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An Introduction to "Avian Genomics 
in Ecology and Evolution: From the Lab into 
the Wild" 

Robert H. S. Kraus 


Abstract 

Recently, the use of next-generation DNA sequencing developed from a technol¬ 
ogy that only industry could use into a common tool in nonmodel biology. Not 
long ago, ecologists and evolutionary biologists would rather shy away from 
taking on a project in which whole-genome sequencing would be the major tool. 
However, much sooner than anticipated, a new generation of students had been 
trained in the use of bioinformatics tools and concepts and they would leverage 
genomic technologies to their full potential—not only in medical sciences but 
also in ecology and evolution. Tumbling prices on the sequencing 
market allowed the large ornithological community to adopte this technology 
early. Birds can be seen as a model group among many other organismal groups. 
Substantial efforts of the community have built a foundation of avian genomics in 
ecology and evolution. In this book, we summarize the major advances that now 
constitute the first steps taken into making whole-genome analyses common¬ 
place. In ten peer-reviewed chapters (in addition to this introduction) we provide 
an overview of the use of genome technologies in avian biology research espe¬ 
cially for an audience that might not currently be part of the “genomics revolu¬ 
tion.” We thereby aim to mediate between early adopters of avian genomics and 
interested professionals. 
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1 Introduction 

The recent rise of next-generation DNA sequencing technologies has transformed 
biological and medical research (Shendure et al. 2017). In a single decade, not only 
so-called second-generation sequencing technologies have become established but 
also third-generation sequencers are now routinely used. While the former offered 
orders of magnitude of increased throughput at the cost of accuracy and length of the 
output DNA sequences, the latter revolutionized the field, resulting in very long 
DNA sequences. Third-generation sequencing was advocated as the solution to the 
“short read era” of second-generation sequencing (Munroe and Harris 2010), but it 
took many more years to become stable, sufficiently cheap (it is still more expensive 
than second-generation sequencing), and established (Lee et al. 2016). Most 
recently, a relatively old idea of pulling DNA strands through genetically engineered 
nanopores while measuring changes in electric potential (Ashkenasy et al. 2005; 
Deamer and Akeson 2000) hit the market and now pushes the boundaries of what is 
possible in terms of the lengths of a DNA strand that can be sequenced in one piece 
(Ashton et al. 2015; Goodwin et al. 2015; Madoui et al. 2015). A profound impact on 
the molecular sciences is expected, in particular for medical applications—from 
cancer research to genetic disease, from personalized medicine to genome-wide 
association studies via population re-sequencing. But what does all of this have to 
do with birds? 

Birds have been important for biological research disciplines throughout the 
history of biology. They attract members of the general public like no other group 
of animals; and birders—people with a personal interest in watching and enjoying 
the life of birds—are perhaps the largest nonprofessional naturalist community. This 
may well be due to an omnipresence of birds across biomes and habitats, and the ease 
with which they can be observed everywhere: in cities, in the countryside, in forests, 
in high mountain ranges, or on the coast. They are also the most speciose class 
among terrestrial vertebrates, with around 10,500 extant species (varying depending 
on the checklist of choice). The richness of forms triggered the curiosity of many 
early naturalists who later emerged as leading personalities for biology as a whole. 
For instance, the theory of evolution proposed by Darwin (1859) was greatly 
inspired by his studies on the intriguing diversity of ground finches, later called 
the “Darwin Finches,” on his 1826 travels to the Galapagos Islands (Lack 1947). 

In addition, birds play an important role in medical and pharmaceutical research 
(e.g., eggs used as bioreactors), and they are an important source of human food, i.e., 
meat and eggs (Petitte and Mozdziak 2007). Chickens, for example, are bred 
industrially, with a global population exceeding that of humans by more than 
two-fold. Therefore, among the first genomes of vertebrates to be fully sequenced 
after the Human Genome project (International Human Genome Sequencing Con¬ 
sortium 2001) was indeed the chicken (Hillier et al. 2004). Genomics in birds is a 
major contributor to the general knowledge base of all biological fields (reviewed in 
Jax et al. 2018; Kraus and Wink 2015). 

Many avian studies adopted molecular methods very early on, even before the 
onset of the first generation of DNA sequencing (Sanger 1981; Sanger et al. 1977) 
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and polymerase chain reaction (PCR; Saiki et al. 1988, 1985). Michael Wink 
reviews the early adoption of molecular methods, particularly in avian systematics, 
from a historical perspective and looks into the origins of genomics in avian research 
and where it might go (Wink 2019). Next-generation DNA sequencing has 
proliferated in applications in ecology and evolution studies too, beyond the core 
technological and medical applications, and several biological disciplines have been 
transformed into big data sciences, with all the accompanying benefits but also costs 
(Ekblom and Wolf 2014). Although the above-mentioned early avian model species 
such as chicken or zebra finch have been studied intensively, many other species 
have become models themselves. When the zebra finch became genome-sequenced 
(Warren et al. 2010), it did not take long to announce that a new era of nonmodel 
avian genomics had started (Balakrishnan et al. 2010). Obviously, this was not that 
straightforward because not every type of knowledge can be transferred that easily 
between species. Several large-scale bird sequencing projects were carried out over 
the following years to provide reference genome assemblies for a broad taxonomic 
range of birds (Zhang et al. 2014a). However, when comparing research groups with 
a tradition in genomics with those groups emerging from the sequencing revolution, 
the great value that established model species have when embarking on deciphering 
genomes of nonmodel species can be observed. Without a detailed understanding of 
the chicken genome and a comprehensive suite of wet and dry laboratory 
technologies pioneered by the chicken genetics and genomics community, most of 
today’s nonmodel genomics would not be possible. In this book, Vignal and Eory 
(2019) look into the question of whether or not advances in sequencing technologies 
today indicate the advancing obsolescence of the classical bird model. 

One of the earliest fields of genome-scale studies was to microscopically examine 
chromosomes. Combined with genetics, this discipline is called cytogenetics and has 
contributed greatly to our knowledge of genome organization across the avian tree of 
life. Damas et al. (2019) explain the contributions of early and modem cytogenetic 
research that today go hand in hand with DNA sequencing technologies. Closely 
related to the large-scale organization of genomes at the chromosomal and the 
sub-chromosomal levels, and highly relevant for questions regarding genome assem¬ 
bly, are repetitive regions. In early work in the human genome consortia, one of the 
major findings was that large portions of the vertebrate genome are not only 
noncoding but also highly repetitive (International Human Genome Sequencing 
Consortium 2001). Genome scientists have called these regions “genomic dark 
matter” and Weissensteiner and Suh (2019) light the way for us with an extensive 
overview of the known repeat stmctures in the genomes of birds and what their roles 
might potentially be. 

The outcome of evolution is reflected in the tree of life. Forms arise, change, and 
vanish. The paleontological record clearly shows this diversity of forms across the 
eons. Biologists have been fascinated for decades and centuries with deciphering 
phylogenetic relationships and building trees of speciation histories. The first group 
of animals to be sequenced for their whole genome and in a phylogenomic context 
was indeed birds (Jarvis et al. 2014). One representative of every order of birds 
underwent genome sequencing and was comparatively analyzed (Zhang et al. 
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2014b, c). However, this was just the start of a new era of phylogenomics and 
comparative genomics across all the bird species in the world (Zhang 2015); an 
avian-specific project that serves as a blueprint for future studies in other groups of 
organisms (Jarvis 2016). With genome-scale data sets, the field of phylogeny has 
encountered significant challenges that need to be well balanced against the wealth 
of information that accompanies whole-genome data (Delsuc et al. 2005). There are 
not only technical hurdles such as computational complexities of big data but also 
difficult choices such as which parts of the genome should be used for inference, and 
in what way can the core questions be addressed using the large amount of avian 
genome data that are available. In this book, Braun et al. (2019) cover many aspects 
of why phylogenomics is difficult, and how problems can be overcome. Looking at 
the branches of a phylogenetic tree reveals a problem of its own: what is a species? 
Generations of biologists have had long-standing arguments on this question and the 
latest genomic techniques have not solved the puzzle. In this book, we deal with the 
issue in detail in a chapter on species concepts, with specific reference to avian 
species (Ottenburghs 2019). 

Processes at the level of the population have been routinely studied in ecology 
and evolution for decades. An extensive body of literature has accumulated on how 
geography and climates of the past have shaped the population structure of today. 
These phylogeographic studies make use of population genetic theory and with the 
advent of genome-scale data they deliver ever finer and more complex models and 
predictions (Ottenburghs et al. 2019). Population studies on recent time scales, in 
particular incorporating detailed knowledge of pedigree and other related informa¬ 
tion within populations, also make quantitative predictions. Husby et al. (2019) 
show how combined knowledge and technology of classical genetics and genomics 
from animal breeding model species (e.g., the chicken) and genomic studies of wild 
populations advance our understanding of trait evolution on short time scales (micro¬ 
evolution). Understanding the background of avian phenotypes and the heredity of 
traits enriches the tool box of conservation managers. In addition, there is an obvious 
and omnipresent discussion on the role of genetic and genomic diversity in species’ 
abilities to cope with environmental change. Cassin-Sackett et al. (2019) look at this 
and at the role that genomics can play versus more traditional genetic technology. 

Our book closes with a special notion uniting the historical and paleontological 
aspects of avian research with the modern approaches of cytogenetics and compara¬ 
tive genomics. Both morphological and genomic evidence have clearly shown that 
birds are in fact dinosaurs, and Griffin et al. (2019) take you on a journey through 
Jurassic Park to appreciate this curious thought to its fullest potential. 

Our team of authors thanks you, the reader, for taking the time to delve into the 
breadth and depth of avian genomics. We appreciate that some parts may be too 
detailed for a certain audience and others too vague. We aimed to strike a balance by 
addressing a nongenomics audience but are aware that this can never work perfectly. 
Our aim was to get as close to this idealistic and challenging goal as we possibly 
could. The chapters of this book were peer reviewed by blinded independent experts 
in their fields to make sure that no important details are missing and that no chapter 
becomes too obfuscated with specialist terminology. The various aspects of 
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genomics, whether technical or conceptual, require some time for familiarization. 
We hope that our texts contribute to this process and provide starting points for 
digging deeper where necessary and/or desired. 
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A Historical Perspective of Avian Genomics 


Michael Wink 


Abstract 

A traditional aim of avian taxonomists and systematists was to establish a reliable 
phylogenetic framework, the avian tree of life. Until 50 years ago, the only way to 
establish systematic relationships relied on the comparison of morphological 
characters, which could be misleading because of convergent character evolution. 
The first molecular approach used the electrophoretic separation of proteins (from 
eggs). This was followed in the 1980s by DNA-DNA hybridisation. Both 
methods provided some insight but did not show sufficient resolution. Better 
results were obtained from nucleotide sequencing of marker genes, which started 
in the 1990s. A real breakthrough came with next-generation sequencing (NGS), 
which allowed sequencing a large portion of the avian genome. The review 
illustrates and briefly discusses the achievements in the past and limitations of 
the different methodological approaches. 
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Systematics • Taxonomy • Classification • Morphology • Phylogenetics • Genome • 
DNA-DNA hybridisation • Sequencing • Next-generation sequencing 


1 Ornithology: From Aristoteles to Linne 

Today’s sciences are shaped mainly by Greek/Latin history that has strongly 
influenced Western European scholars. I am aware that naturalists around the 
world have studied birds and quite scientifically so. But, in this overview, I have 
focussed mainly on Western science. 
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The Greek philosopher and naturalist Aristotle (384-322 BC) already recognised 
140 bird species and described their morphology, anatomy, behaviour, distribution 
and biology. The Roman naturalist Plinius (23-79 AC) produced a Historia 
naturalis with 37 volumes and ordered birds according to their anatomy of legs 
and feet. Aristotle and Plinius remained the authorities for more than 1000 years, and 
most later, bird descriptions [e.g. from Albertus Magnus (1193-1280)] were copied 
from their books. The Holy Roman Emperor Frederick II (1194-1250) was an 
exception in so far, as he used his own observation for his book on falconry De 
arte venandi cum avibus (Stresemann 1951). 

The Renaissance (from the fifteenth century onwards) was important for the 
development of natural sciences; it also brought progress for ornithology. The 
invention of book printing by Johannes Gutenberg around 1450 facilitated the 
rapid dissemination of novel findings and the publication of illustrated animal and 
plant books. The following three authors were especially important for the progress 
in ornithology: William Turner (1500-1568), Pierre Belon (1517-1564) and Conrad 
Gessner (1516-1565). In the illustrated Historia animalium , Gessner described 
180 bird species. His knowledge was derived from his own observations and of 
those of other experts. After several reprints of Gessner’s popular work, John 
Jonston (1603-1675) produced the Historiae naturalis de avibus libri VII in 1650 
with excellent illustrations from Matthaeus Merian (Fig. 1). 



Fig. 1 Illustrations from Historiae naturalis de avibus libri VII, in which species were ordered 
according to similarity. As can be seen, shrikes and cuckoo were grouped with raptors because of their 
bill morphology (a) and even bats were classified as birds because of their wings (b) (Photo M. Wink) 
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Progress of ornithology also profited from the expeditions into foreign countries 
and continents, which started around 1500. Explorers brought back live birds, which 
were kept in aviaries (which already existed also many hundred years earlier), from 
which predecessors of zoos were later derived. 

Based on the early collections, naturalists started to elaborate on a taxonomy of 
birds after 1600, which largely replaced the concepts of Aristotle and Plinius. 
Influential bird specialists include Walter Charleton (1619-1707), Francis 
Willughby (1635-1672) and especially John Ray (1628-1704). The Ornithologiae 
libri tres of Ray can be regarded as the first modem ornithology handbook. Ray and 
other naturalists based it on systematic observations (Stresemann 1951). 

The “An historical review of bird taxidermy in Britain” cites Rene Antoine 
Ferchault de Reaumur (1748) saying—the whole progress of ornithology was 
being retarded because of preservation problems. The more detailed studies on 
preservation came from the early 1770s (at least in Britain). With the development 
of bird taxidermy (Schulze-Hagen et al. 2003), exotic bird skins were bought by 
private collectors, which were later sold or donated to the public. The need to make 
these collections widely available led to the opening of natural history museums, 
which we have today. 

The Swedish naturalist and medical doctor Carl Finnaeus (1707-1778) 
introduced a binary nomenclature, which allowed the unequivocal description of 
animal and plant species. For example, the European blackbird was named Turdus 
merula ; the first name corresponds to the genus and the second to the species. No 
other animal will carry this name. Finne tried to introduce systematics in order to 
arrange similar species into genera, orders and classes. Finne distinguished 6 bird 
orders with 85 genera, which were defined by the morphology of beaks and feet. A 
common phenomenon can be seen in unrelated organisms living under similar 
constraints in that they evolve similar adaptations; they are called convergent traits. 
As there are many convergent developments in these characters, some of the 
systematic affiliations were wrong. The following six orders were recognised in 
the tenth edition of Systerna Naturae (1758): 

1. Accipitres : raptors, owls, parrots, shrikes and waxwings 

2. Picae : woodpeckers, hornbills, cuckoos, crows, hoopoes, birds of paradise and 
creepers 

3. Anseres : all water birds, pelicans, loons, grebes, gulls, terns and cormorants 

4. Grallae : waders, flamingos, storks, herons, cranes, coots, bustards and ratites 

5. Gallinae : wild fowl, guans, grouse and quails 

6. Passeres : pigeons, thrushes, larks, humming birds, crossbills, wagtails, tits, swifts 
and nightjars 

The binary concept of Finne was promoted in Germany and Europe during the 
eighteenth century by E. Pontoppidan, M. T. Briinnich, G. A. Scopoli, T. Pennant, 
M. Tunstall, Statius Muller, P. S. Pallas, M. Brisson and J. F. Gmelin. In England the 
binary system of Finne was accepted end of the eighteenth century, whereas in 
France the zoologist G.-F. Buffon (1707-1788) developed his own system in his 
Histoire Naturelle des Oiseaux (1770) (Stresemann 1951). 
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Avian taxonomy and systematics rapidly progressed during the eighteenth and 
nineteenth century. As many explorers set out to travel and to explore Asia, Africa, 
Australia and the Americas, many new bird species were found and described. 
Scientifically curated bird collections were established in this time in Paris (1793), 
London (1881), Frankfurt, Munich, Dresden and Halle, which allowed a better 
understanding of systematics. This was a period when questions concerning the 
status of species and subspecies were extensively discussed. Important ornithologists 
of this period were P.S. Pallas (1741-1811), J.R. Forster (1728-1798), G. Forster 
(1754-1794), F. Levaillant (1753-1824), C. Illiger (1775-1813), J. Kaup 
(1803-1873), J.F. Naumann (1780-1857), C.L. Brehm (1787-1864), J.H. Blasius 
(1809-1870), G. Hartlaub (1814-1900), J. Cabanis (1816-1906), A. Reichenow 
(1847-1914), M. Furbringer (1846-1920), H. Gadow (1855-1928) and 
O. Kleinschmidt (1870-1954) in Germany; J. Latham (1740-1837), N. Vigors 
(1785-1836) (knew already 3125 bird species), P.L. Sclater (1829-1913), 
T.H. Huxley (1825-1895), R. Swinhoe (1836-1877), G.R. Gray (1808-1872), 
E. Hartert (1859-1933), R.B. Sharpe (1847-1909) and W. von Rothschild 
(1868-1937) in Great Britain; C.J. Temminck (1778-1858), H. Schlegel 
(1804-1884), O. Finsch (1838-1917) and H. Boie (1784-1827) in the Netherlands; 
J.B. de Lamarck (1744-1829) and G. Cuvier (1769-1832) in France; and J. Bartram 
(1699-1777), C.W. Peale (1741-1827), A. Wilson (1766-1813), J.J. Audubon 
(1785-1851), C.L. Bonaparte (1803-1857), J. Cassin (1813-1869), W. Swainson 
(1789-1855), R. Ridgway (1850-1929) and J.A. Allen (1838-1921) in North 
America (Stresemann 1951; Walters 2003). 

Avian systematics saw many changes in the eighteenth and nineteenth century. 
The discussion on the taxonomy of birds changed when Charles Darwin 
(1809-1882) and A.R. Wallace (1823-1913) introduced the theory of evolution 
through natural selection and the concept of phylogeny. This theory allowed an 
understanding how adaptations in morphology, behaviour and other traits could 
evolve via natural selection. In addition, the idea of phylogenetic trees and the 
descent from commons ancestors was elaborated by C. Darwin. These were revolu¬ 
tionary new concepts, since many people believed at that time that God had created 
each species separately and that species do not change. 


2 Towards a Natural System of Birds in the Twentieth 
Century 

Ornithologists after Darwin tried to establish a “natural system” of birds, which 
would reflect a common phylogeny. More than 40 different systems have been 
proposed during the last 200 years (Walters 2003). In the twentieth century, several 
systems were introduced, which were used to arrange the order of bird families in 
handbooks and field guides (e.g. Gadow 1892; Wetmore 1930-1960; Peters 1931— 
1986; Stresemann 1927-1934; Mayr and Amadon 1951; Wolters 1982; Sibley and 
Monroe 1990; Dickinson 2003; Clements 2007). A summary of some of these 
classification systems can be found in Sibley and Ahlquist (1990). 
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Taxonomy and systematics were mainly based on morphological characters, such 
as plumage colouration and shapes of beak, head and feet but also anatomical details. 
In the twentieth century, other biological, biogeographical, ecological and biochem¬ 
ical characters were additionally studied. The main concept of classifying species in 
all these systems was overall similarity of characters. The more similar, the more 
closely related two taxa were assumed to be. In most instances, avian taxonomists 
usually agreed on the inclusion of a group of similar taxa into a common genus. 
More difficult was sometimes the attribution to families and orders. Difficulties 
could arise, because in many birds, plumage can change according to sex, age and 
season. Only, when large collections became available, it was possible to place all 
the different plumage forms into a single taxon. Furthermore, a number of bird 
species can hybridise; thus hybrids might be wrongly classified in bird collections. 
As mentioned before, many characters are adaptive which can undergo convergent 
evolution. Thus, similar characters could evolve in unrelated groups. Placing them 
together in a common assemblage would create artificial and polyphyletic groups 
(groups, which contain members from other lineages). Taxonomy and classification 
also depend on species concepts, which also changed over the last 200 years. Avian 
species concepts and speciation processes will be discussed in detail in Ottenburghs 
(2019) and Cassin-Sackett et al. (2019). 

An important school of taxonomy, cladistics, was founded by Willi Hennig 
(1913-1976) in the 1950s. He introduced plesiomorphic, apomorphic and 
synapomorphic traits to define groups of organisms of common ancestry. Groups, 
which included all descendants from a common ancestor, were called “monophy- 
letic”. If taxonomists follow the rules of cladistics, they can distinguish between 
monophyletic, paraphyletic and polyphyletic groups. Only monophyletic groups 
should be the base of a natural system of classification. Whenever para- and 
polyphyletic groupings are discovered, a taxonomic revision becomes necessary. 
Especially, since the introduction of DNA analyses (which allowed the detection of 
non-monophyletic assemblages), many revisions at the genus and family level had 
become necessary (Wink 2011, 2014) (see Ottenburghs 2019; Cassin-Sackett et al. 
2019). 


3 Systematics and Phytogeny in the Age of DNA Analysis 

In all areas of biology, including ornithology, a new era came with the study of DNA 
and genetics. The structure DNA was elucidated in 1953 by James Watson and 
Francis Crick. Progress in the field of genetics depended on new technologies: In 
1978, DNA sequencing was established, 1985 polymerase chain reaction (PCR) and 
after 2000 next-generation sequencing (NGS), which allows the parallel sequencing 
of thousands of DNA sequences (see Weissensteiner and Suh 2019; Braun et al. 
2019; Ottenburghs et al. 2019; Husby et al. 2019; Griffin et al. 2019). 

DNA is present in the cells of all living organisms, from simple bacteria to Homo 
sapiens. Before any cell division, DNA is copied, so that a daughter cell obtains an 
almost identical copy of the mother cell. Since all cells originate from a mother cell 
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through cell division, all cells, which exist in all individuals on this planet today, 
must have been derived from common ancestral cells. Thus, we have a continuous 
lineage of cells and DNA back to the origin of life. DNA undergoes a few mutations 
during the replication process or through mutagenic agents (chemicals, radiation) 
and spontaneous chemical hydrolysis of nucleotide bases. These mutations are 
preserved in specific positions in the genomes, if they are not corrected by repair 
enzymes or not lethal for the individual. If these mutations occur in the germline and 
are transferred to the next generation, they will (in general) continue further on in 
subsequent generations and over time can become fixed in the population. As a 
consequence, all individuals, which live today, carry a DNA with millions of 
changes in their genomes which distinguish them from their ancestors and other 
species. Consequently, millions of single nucleotide polymorphisms (SNPs), 
insertions and deletions, as well as other larger-scale mutations, exist. The analyses 
of these mutations enable phylogeneticists to reconstruct a common descent of all 
organisms and to infer a reliable “tree of life”. Mutations usually occur at random 
and at almost similar frequency and speed. Therefore, the concept of a molecular 
clock can be used to date ancient divergence events (Thorpe 1982) (see Braun et al. 
2019; Weissensteiner and Suh 2019; Griffin et al. 2019). 

As in other animals, avian DNA is present in the nucleus of a cell arranged in 
chromosomes. Most birds have more chromosomes than most mammals: For exam¬ 
ple, the diploid genome of a heron consists of 68 chromosomes, while many other 
bird taxa have 80 and more chromosomes. In general, 25% of these chromosomes 
are categorised as large macrochromosomes (>40 Mbp), while the others are tiny 
microchromosomes (<20 Mbp), which undergo a high degree of recombination (see 
Damas et al. 2019). The sex chromosomes of birds are different from mammals in 
the sense that the females are the heterogametic sex. Females are characterised by 
WZ and males by ZZ gonosomes (as opposed to the mammalian XX and XY, 
respectively) (see Damas et al. 2019). The bird genomes are also smaller than the 
average mammalian genomes and usually contain over 1 billion nucleotides in the 
haploid genomes and encode about 20,000 protein-coding genes, similar to 
mammals. The content of repetitive DNA is lower in birds than in mammals 
(Weissensteiner and Suh 2019; Kraus and Wink 2015; Zhang et al. 2014). 

In addition to nuclear DNA, eukaryotic cells carry 100 to 1000 mitochondria 
(organelles important for energy metabolism) which carry their own circular and 
maternally inherited DNA (mtDNA). mtDNA is rather small and consists of 
16,000-19,000 nucleotides, which encode for 13 proteins, 22 tRNA genes and 
2 rRNA genes. Mitochondrial DNA is inherited maternally, as the oocyte has several 
mitochondria and the spermatocyte only few. Upon fertilisation, only the maternal 
mitochondria are maintained. Studying mtDNA alone, hybridisation can distort the 
phylogenetic interpretation. As mtDNA is more variable than protein-coding nuclear 
DNA, the sequencing and comparison of mtDNA have become an important tool to 
study bird taxonomy and systematics (Avise 1987; Shields and Helm-Bychowski 
1988; Funk and Omland 2003) (see Braun et al. 2019; Ottenburghs et al. 2019). 

DNA can be recovered from any bird tissue but is also present in feathers or in 
buccal swabs. Using PCR, marker genes (such as mitochondrial cytochrome b, COI, 
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ND2) can easily be amplified and sequenced using the chain termination method of 
Sanger. A wider set of genes or even complete genomes and transcriptomes can be 
sequenced simultaneously using NGS methodology (Kraus and Wink 2015), as will 
be outlined in later chapters (see Weissensteiner and Suh 2019; Braun et al. 2019; 
Ottenburghs et al. 2019; Husby et al. 2019; Griffin et al. 2019). 


4 DNA-DNA Hybridisation 

Already before DNA information could be read letter by letter using emerging 
sequencing technologies, isolated DNA had its use in the study of bird systematics. 
Charles Sibley was the first scientist to use DNA analysis for the study of bird 
systematics. Before this, he had tried to use the analysis of proteins from eggs 
(Sibley 1960; Sibley et al. 1974). However, the information gained from electropho¬ 
retic profiles of proteins was too limited as to be of wider use in taxonomy. As a new 
endeavour, he started his DNA work in 1975 when DNA sequencing was still in its 
infancy. Instead of sequencing, Sibley used DNA-DNA hybridisation. The two 
complementary strands of the DNA double helix can be separated, for example, 
by heating. When the temperature is lowered again, then the complementary DNA 
strands will find each other and form a double helix, the process of which is called 
reassociation. Depending on the abundance of GC pairs, the thermal stability of 
DNA can differ. GC-rich DNA melts at a higher temperature than GC-poor DNA. 
The melting temperature can be reliably and accurately measured. Sibley used the 
concept of DNA-DNA hybridisation, discovered by Doty et al. (1960) and Marmur 
and Doty (1961). Briefly, as DNA molecules from different species differ in their GC 
content, consequently their melting temperatures should also be different. The more 
closely related two species are, the more similar their melting temperatures be. A 
thorough description of DNA-DNA hybridisation is provided by Sibley and 
Ahlquist (1990), with details on DNA isolation, preparation of single-copy tracer 
DNA, hydroxyapatite (HAP) chromatography to generate single-copy DNA, radio¬ 
active labelling and reassociation kinetics. The authors also provide details on data 
processing, data corrections, controls and tree reconstructions by UPGMA clustering 
(Sibley and Ahlquist 1990). 

In an extraordinary study, Charles Sibley together with Jon Ahlquist analysed the 
DNA melting curves of more than 1700 bird taxa. The results of their studies were 
published in 1990 as the Phytogeny and Classification of Birds. Sibley used this 
information to create a new avian systematics, which was published by Sibley and 
Monroe in 1990 as the Distribution and Taxonomy of Birds of the World. The 
authors used the following ordering system, starting with the class Aves; they 
distinguished the categories subclass Neomithes (with the infraclasses Eoaves and 
Neoaves) and parvclasses, orders (with suborders and infraorders), superfamilies and 
families, tribes and genera. 
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5 Comparison of the Sibley Phylogeny with NGS Data (See 
Braun et al. 2019) 

The infraclass Eoaves is basal and includes the Struthioniformes and Tinamiformes. 
This grouping was also recognised by more recent studies (Hackett et al. 2008; Jarvis 
et al. 2014; Prum et al. 2015) which are based on partial or full-genome DNA 
sequence analyses but termed “Palaeognathae” (Fig. 2). The rest of the birds are 
recognised as Neognathae in modem phylogenies, which was termed Neoaves by 
Sibley and Monroe (1990). However, the Neoaves of current phylogenies does not 
include the Galloanserae, which were recognised as the parvclass Galloanserae (with 
the orders Craciformes, Galliformes and Anseriformes) by Sibley and Monroe in 
1990. In general, the general structure of the genomic avian trees of life (based on 
genomic sequences; Hackett et al. 2008; Jarvis et al. 2014; Prum et al. 2015) is 
similar but not identical to that of Sibley and Monroe (1990). 

Sibley and Monroe (1990) recognised five more parvclasses: the parvclass 
Tumicae (with the order Tumiciformes), the parvclass Picae (with the order 
Piciformes), the parvclass Coraciae (with the orders Galbuliformes, Bucerotiformes, 
Upupiformes, Trogoniformes, Coraciiformes), the parvclass Coliae (with the order 
Coliiformes) and the large parvclass Passerae (with the orders Cuculiformes, 
Psittaciformes, Apodiformes, Trochiliformes, Musophagiformes, Strigiformes, 
Columbiformes, Gruiformes, Ciconiiformes and Passeriformes). With regard to 
DNA sequence (Hackett et al. 2008; Jarvis et al. 2014; Pmm et al. 2015), strong 
discrepancies have been discovered (see Braun et al. 2019). For example, the large 
order Ciconiiformes consisted in Sibley’s systematic of the suborders Charadrii 
[including sandgrouse (Pteroclidae), waders, gull and auks] and Ciconii (including 
raptors, falcons, grebes, storks, herons, flamingos, penguins, loons and shearwaters). 
In DNA sequence phylogenies, sandgrouse, raptors, falcons, grebes and flamingos 
have very different phylogenetic positions (see Braun et al. 2019). I will restrict the 
discussion to these few examples, which show that Sibley and Ahlquist (1990) and 
Sibley and Monroe (1990) were correct in some groupings but also very wrong in 
others. These and other discrepancies can be partly explained by the methodology of 
DNA-DNA hybridisation, which does not provide sufficient resolution at the level of 
some phylogenetic categories and was also prone to laboratory artefacts (e.g. how to 
separate single-copy DNA from repetitive DNA). Sibley and Ahlquist (1990) were 
aware of the limitations of the DNA-DNA hybridisation method, but, at that time, it 
was the best method available. 


6 The Era of DNA Sequence Analysis 

The development of rapid DNA sequencing technologies was important for most 
branches of biology. In the 1970s two sequencing methods were developed, the 
chemical sequencing by Maxam and Gilbert and the enzymatic Sanger sequencing 
(Sanger 1981; Saiki et al. 1985, 1988). At first, DNA fragments were separated by 
gel electrophoresis on high-resolution gels. At that period, DNA sequencing was a 
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Fig. 2 The “avian tree of life” after Hackett et al. (2008) based on nucleotide sequences of 19 nuclear 
genes 


challenge and time consuming. The next technological jump in the 1990s was the 
development of capillary electrophoresis sequencers, for example, from ABI. Major 
improvements were coming from upgrades to machines that have 4-8-24-48 up to 
96 capillaries so that several sequences could be analysed at the same time. Large 
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sequencing core facilities utilised multiple machines, which allowed the sequencing 
of the human genome in 2001. Next, there was a suture in DNA sequencing, when a 
new generation of DNA sequencers (the pyrosequencer 454 from Roche) hit the 
market with a complete new concept: parallel sequencing. Further developments of 
next-generation sequencing (NGS) were sequencers from Illumina, SOLiD, Ion 
Torrent, PacBio and Nanopore which enable sequencing of complete genomes and 
transcriptomes (Metzker 2010; van Dijk et al. 2014; Krauss and Wink 2015; Jax 
et al. 2018) (see also Weissensteiner and Suh 2019; Braun et al. 2019; Ottenburghs 
et al. 2019; Husby et al. 2019; Griffin et al. 2019). 

When DNA sequencing became more widely available, bird phylogenies were 
reconstructed from an analysis of the nucleotide sequences of one or more marker 
genes [in the beginning only mtDNA, later mtDNA and nuclear DNA (ncDNA) 
were used] from each species. These phylogenies provided a much higher resolution 
at the family and genus level but often failed to infer divergences in the far past 
(Kraus and Wink 2015). A breakthrough was achieved by Hackett et al. in 2008 who 
sequenced 19 nuclear genes of each of the major bird families using traditional 
capillary electrophoresis sequencers. A simplified phylogeny is shown in Fig. 2. 
Hackett et al. (2008) had already recognised important clades, which were confirmed 
later by genome sequencing via NGS (Jarvis et al. 2014; Prum et al. 2015) (see 
Braun et al. 2019; Weissensteiner and Suh 2019). These are the sister-pair relation¬ 
ship of grebes and flamingos, a common ancestry of swifts and nightjars, the 
separation of falcons from diurnal raptors, inclusion of New World vultures in the 
raptor clade and a new clade combining falcons, parrots and passerine birds (Wink 
2015). 

In comparison to other animal groups, birds entered the era of genomics rather 
late (Kraus and Wink 2015; Braun et al. 2019). The first sequenced avian genome 
was that of the chicken (G alius galius) (Hillier et al. 2004), followed by the genome 
of the zebra finch ( Taeniopygia guttata) (Warren et al. 2010), and then the turkey 
(Meleagris gallopavo) (Dalloul et al. 2010), the pied and collared flycatcher 
(Ficedula hypoleuca , F. albicollis) (2012) and peregrine, saker falcon and mallard 
(Falco peregrinus , F. cherrug , Anas platyrhynchos) (Huang et al. 2013) (Zhang 
et al. 2014). Most early genomes were from agriculturally important species 
(chicken, turkey, duck). These approaches were driven by breeding genomics or 
veterinary/biomedical applications (see Vignal and Eory 2019). 

The above-mentioned genome data were instrumental for avian phylogenetics: 
An important step forward came, when real genome data of representative taxa of the 
avian tree of life were produced by NGS. The presently used Illumina technology 
allows the rapid parallel sequencing of up to 250 million short sequences (50-200 
nucleotides) in a single run; newer technologies aim for longer reads (Kraus and 
Wink 2015). 

A first phylogenomic avian tree of life was published in 2014 by the Avian 
Phylogenetic Consortium (Jarvis et al. 2014), which was followed in 2015 by a more 
detailed genome analysis of Prum et al. (2015) based on nucleotide sequences of 
259 nuclear genes and a total of 394,000 sites, which covers 198 species in 
122 families and 40 orders, providing the most up-to-date and comprehensive 
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avian phylogeny (details in Braun et al. 2019). The study of Prum et al. (2015) can be 
regarded as an extension of Hackett et al. (2008). The advantage of the multigene 
analysis of Prum et al. (2015) over Jarvis et al. (2014) is that they could include more 
taxa and that all the genes sequenced could be unequivocally aligned. Jarvis et al. 
(2014) used partial genome data, which suffer from difficulties to align homologous 
DNA sequences and to avoid inclusion of chimeric sequence assemblies and 
pseudogenes. 

Sequencing technologies will develop further. From a technological point of 
view, Illumina has won the battle for throughput, but third-generation sequencing 
such as PacBio or Nanopore sequencing is beginning to reach the laboratory. These 
new NGS technologies allow the generation of long sequence runs. The new era of 
long-read sequencing coupled with near-chromosome scaffolding is bringing huge 
improvements especially in gene completeness of assemblies (Gordon et al. 2016). 
The longer the reads and the higher the quality of the sequences, the better the 
phylogenies (see also Weissensteiner and Suh 2019; Braun et al. 2019; Ottenburghs 
et al. 2019; Husby et al. 2019; Griffin et al. 2019). 

As we presently recognise well over 10,300 bird species (some estimates assume 
even more than 18,000 bird taxa; Barrowclough et al. 2016), it will certainly take 
some time until we have the “avian tree of life”, which will reliably provide the 
phylogenetic position and history for each of the avian species. Having a tree of life, 
we will finally have the framework for systematics but also to study the evolution of 
traits and adaptations in birds. 
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Avian Genomics in Animal Breeding 
and the End of the Model Organism 

Alain Vignal and Lei Eory 


Abstract 

Avian genomics have long benefited from chicken being not only an important 
agricultural species but also a model organism for various fields of research 
including genetics, immunology, embryology and more recently vertebrate 
genome organisation and function. Thanks to this, chicken was in 2004 amongst 
the first vertebrates sequenced, and its genome annotation still benefits from a 
large community of users. Hundreds of quantitative trait loci (QTLs) have been 
mapped in chicken, and genome-wide information is now used in breeding 
programmes in agriculture. The chicken genome sequence was soon followed 
by the zebra finch sequencing, a model for the biology of learned vocalisation. 
Shortly after, the advent of new sequencing technologies, reducing cost and 
increasing speed of sequencing, allowed the turkey and duck genomes to follow. 
Now, close to 50 bird genomes are published, and with the ever-decreasing cost 
of sequencing and most recent genome assembly techniques, many more are 
expected soon. In the new era of genomics technologies, will model organisms 
remain essential? 
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1 Introduction: Chicken as a Model Organism for Birds 

The domestication of chicken is thought to have taken place between 6000 and 
10,000 years ago in South East Asia (Tixier-Boichard et al. 2011) and possibly also 
in China (Xiang et al. 2014, 2015). Nowadays, chickens are raised worldwide for 
meat and egg production and are amongst the main sources of protein for human 
nutrition. Domestic animal species are tame and readily available, making them an 
easy source of study, with scientific observation using chicken as a model going as 
far back Aristotle’s description of the embryo (Lennox 2017). Since then, chicken 
has been a model organism for the study of many other biological questions and is 
still essential in cancer, cardiovascular and embryology research (Burt 2007; Kain 
et al. 2014). As a consequence, the species has become one of the major animal 
species and certainly the most prominent bird studied in the genomics era (Schmutz 
and Grimwood 2004). 

Breeders have worked for long towards selecting desirable quantitative traits such 
as growth, feed efficiency, carcass composition or the number and quality of eggs 
laid. Meanwhile, interests have also focused on qualitative variations in traits such as 
plumage, skin or eggshell colours, now the hallmark of many local breeds. All these 
quantitative and qualitative genetic variations are conserved in commercial or 
experimental populations and in fancy breeds and are now broadly studied both to 
improve phenotype-genotype interactions and breeding practices and to understand 
basic biology (Delany 2004; Tixier-Boichard 2007). Examples of chicken pheno¬ 
typic variation are shown in Fig. 1. Concerning basic research, a number of 
discoveries stemming from using chicken as a model organism include the discovery 
of oncogenic viruses through the transmission of cancer by way of a cell-free filtrate 
from a sick to a healthy chicken (Rous 1911) or the delineation of the thymic and 
bursal lymphoid systems in 1965 (Cooper et al. 1965). But embryology remains the 
main domain in which chicken is a prominent model organism, thanks to an easy 
access to the embryos in the eggs and a precise description of its developmental 
stages goes back to 1951 (Hamburger and Hamilton 1992). Paving the way towards 
the era of whole genome sequencing, the construction of molecular genetic maps 
started in the early 1990s (Bumstead and Palyga 1992; Levin et al. 1994), thus 
establishing chicken as a reference for other studies in birds. Progress in molecular 
genetics in the second part of the twentieth century has been extremely rapid, from 
the identification of DNA as the physical support for genetic information (Avery 
1944) to the first publication of the human genome (Lander et al. 2001 ; Venter et al. 
2001). Since then, a lot more has been done to understand the function of genomes, 
such as exemplified in the ENCODE project (ENCODE Project Consortium et al. 
2012). Farm animals were amongst the very few vertebrates to follow the path of 
human genomics, together with model organisms such as mouse, rat or zebrafish. 
Amongst these, chicken was the first to have a whole genome sequence published as 
soon as 2004, thanks to its importance also as a model organism, for being the most 
studied representative of birds in addition to its economic importance. 

The first draft vertebrate genome assemblies produced in the 2000s have always 
been the fruit of coordinated efforts organised by large consortiums of research 



Avian Genomics in Animal Breeding and the End of the Model Organism 


23 



Fig. 1 A sample of selected phenotypic variation in chicken 

groups collaborating with one or several high-capacity sequencing centres (Mullikin 
and McMurray 1999). In the case of the human genome project, sequencing centres 
from many countries were involved, each of which having dozens of semiautomated 
Sanger sequencing machines and all the associated laboratory equipment (Lander 
et al. 2001). A totally different picture can be seen nowadays, with the so-called NGS 
(next-generation sequencing) technologies allowing the production of very large 
datasets at low cost (Shendure et al. 2017). As a consequence, other bird species than 
chicken have had their genomes sequenced, either because of their interest in basic 
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research, such as the quail in embryology and the study of neural crest cell migration 
(Le Douarin and Teillet 1974) or the zebra finch for the neurobiological basis of 
vocal learning (Mello 2014), or due to their interest in human nutrition, such as the 
common duck or the turkey (Dalloul et al. 2010; Huang et al. 2013). With increasing 
sequencing capacity and decreasing costs, the number of bird species having their 
genome sequenced is steadily increasing (Zhang et al. 2014b; Kraus and Wink 
2015). Having a whole genome sequence assembly to use as a reference is now 
considered a prerequisite for any work involving genomics, but how much will this 
new era of cheap genome sequencing affect the position of chicken as prominent 
bird model species? 

To answer this question, we will explore the long and winding path towards the 
present state of knowledge in chicken, starting with the observation of chromosomes 
and the first molecular genetic maps and having now led to a deep understanding of 
the genome through sequencing and annotation, both structural and functional. We 
will also show a few examples of biological questions of interest to breeders and to 
molecular ecology and evolution, whose study have benefited from this knowledge. 


2 The Dawn of Chicken Genomics: From Maps to Genome 
Sequence and Annotation 

For technical descriptions of maps and genome assembly structure, we offer the 
readers Fig. 2 and legend. 


2.1 Genomics for Improving Breeding Practices 

Interestingly, the first coordinated efforts towards genome-wide studies in chicken 
were targeted at developing genetic maps for breeding perspectives. The intention 
was to understand the mechanisms underlying the heritable portion of trait 
variability, to discover genetic markers linked to traits of interest and thereafter to 
use this information for genetic improvement through marker-assisted selection 
(MAS). MAS is especially interesting for traits that are too complex or even 
impossible to measure, such as resistance to disease. The most common approach 
to find markers for MAS was to screen specifically designed second-generation “F2” 
or backcross generation “BC” populations to detect quantitative trait loci (QTLs) 
with markers uniformly distributed along the genome. To obtain such collections of 
markers, the development of genetic maps was required, and these first maps later 
appeared handy as a backbone for the whole genome sequence assembly. Indeed, by 
solely relying on sequence data, many species having their genomes sequenced do 
not have assemblies presented at the chromosome level but are instead in the form of 
a collection of continuous genomic DNA sequences named contigs, themselves 
assembled into discontinuous scaffolds (Fig. 2). These scaffolds are not assigned 
to chromosomes and are not even ordered relatively to one another. For instance, in 
the absence of dense genetic maps or other types of mapping information, the current 
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A: Chicken karyotype 



B: maps, sequence and structural annotation 
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Fig. 2 Chicken karyotype, genome maps and sequence assembly, (a) The chicken karyotype is 
typical of bird genomes, having a few macrochromosomes and a larger number of 
microchromosomes. The male is the heterogametic sex, having Z and W chromosomes. 
Microchromosomes cause specific problems for mapping and sequencing, mainly due to their 
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common duck genome assembly is composed of 227,597 contigs assembled into 
78,487 scaffolds. The longest contig and scaffold are 263 kb and 5.9 Mp long, 
respectively (Huang et al. 2013), which is much smaller than the size of chromosome 
arms, estimated to range between 10 Mb and more than 100 Mb (Pichugin et al. 
2001). None of the chromosomal assignment positions of these contigs and scaffolds 
are known, for lack of other substantial mapping information. In the case of chicken, 
the genome assembly is presented as chromosome-level sequences, thanks to the 
availability of a wealth of mapping data in the form of cytogenetic, genetic, radiation 
hybrid (RH) and bacterial artificial chromosome (BAC) contig maps, most of which 
produced in the pre-genome sequence era for helping in the identification of genes of 
interest for the breeding industry. Chicken also benefits from extensive data pro¬ 
duced on gene transcripts, essential for the annotation of the genome and the 
construction of gene models (Yandell and Ence 2012). 


2.2 The Definition of the Karyotype and Cytogenetic Maps 

The definition of the karyotype, a description of the number and appearance of 
chromosomes for a given species, is a prerequisite for having any mapping or 
sequence data assigned properly to a chromosome location. Karyotypes are 
established by cytogenetic observation of chromosomes when condensed at the 
metaphase stage of cell division, and once the standard karyotype is defined by 
classical cytogenetics, fluorescent in situ hybridisation (FISH) will allow the 


Fig. 2 (continued) small size and high (G + C) content. Karyotype courtesy of V. Fillon 
(GenPhySE, INRA Toulouse, France), (b) The chicken genome assembly is the result of coordi¬ 
nated efforts of a consortium of collaborating laboratories. Whole genome sequencing strategies do 
not allow the direct production of an ideally continuous sequence covering each chromosome arm. 
The combination of sequence reads, together with short-distance read-pairing information, allows 
the production of stretches of continuous sequence (contigs) and for the assembly of these contigs 
into discontinuous blocks of sequence (scaffolds). Gaps in scaffold sequence are reported by 
stretches of “N”, whose length corresponds to the estimated size of the gaps. Missing sequence 
can be due to many factors including local sequence complexity and base composition. This 
discontinuous nature of the sequence assembly has an important incidence on the detection of 
genes, whose parts (exons, splice sites, 5' or 3' UTR, etc.) may be missing, introducing 
approximations and errors in the genome annotation. According to various parameters including 
depth of coverage and the sequencing technology used, a genome assembly may be composed of 
thousands of scaffolds. For the assembly of the chicken genome scaffolds into chromosome-level 
sequences, mapping information from genetic, physical and radiation hybrid maps was used. FISH 
mapping allows the assignation of the sequence to the chromosomes. However, some sequence 
scaffolds may not be mapped to chromosomes, in which case they are in a category called the 
unknown sequence. In the chicken assembly, some of the unknown sequence may correspond to the 
not yet identified microchromosomes. Genetic and radiation hybrid mapping is currently used to 
group this unknown sequence into linkage groups. The more recent sequencing technologies, such 
as PacBio, produce long reads and by consequence longer contigs. The mapping technology now 
used most widely for assembling the scaffolds into chromosomes is optical mapping. This combi¬ 
nation allows for higher quality genome assemblies at a much lower cost than previously 
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assignment of DNA fragments to chromosomal regions. Without this, genome 
entities such as genetic markers, genes or genome sequence portions cannot have 
chromosomal correspondence, let alone coordinates. 

Classical cytogenetic studies were performed for hundreds of bird species, 
revealing their very particular karyotype structure and suggesting relationships 
between species by the observation of chromosome rearrangements (Christidis 
1990). The chicken karyotype is typical of the vast majority of birds, with a few 
large chromosomes called macrochromosomes, a higher number of small 
microchromosomes and the Z and W gonosomes, the female being the heteroga- 
metic sex (Schoffner and Kristan 1965). The microchromosomes are very difficult to 
identify, due to their appearance as small spots on metaphase preparations, and 
amongst the 38 pairs of autosomes composing the chicken karyotype, only the 8-10 
largest ones can be identified by classical cytogenetic methods and are usually 
referred to as the macrochromosomes (Ladjali et al. 1995). To identify each 
microchromosome individually, FISH was used with large insert-containing clones, 
usually BACs. Simultaneous hybridisation of groups of clones by two-colour 
FISH allowed the identification of a first series of 16 microchromosomes (Fillon 
et al. 1998), and the definition of the complete karyotype was done by using a 
combination of large insert-containing clones and chromosome paints obtained 
by microdissection (Masabanda et al. 2004). However, only large-insert clones 
allow linking maps and the genome assembly to the karyotype. As a consequence, 
microchromosomes still lacking a clone-based identification do not appear in the 
assembly, although some of their sequence may be present in the unknown fraction 
(see legend to Fig. 2) and in new unassigned genetic map linkage groups (Warren 
etal. 2017). 


2.3 Genetic Maps and the Initial Mapping of Phenotypic Traits 

Genetic mapping is the oldest technique for ordering loci along chromosomes, as it 
works by following the transmission in pedigrees of the different alleles segregating 
at each locus. In chicken, these were at first very simple and directly observable 
phenotypes such as shank colouration or feather shape (Bitgood 1993). Phenotypes 
expressed by alleles whose underlying genes are at loci close together on a chromo¬ 
some will tend to be transmitted together in successive generations, an indication of 
genetic linkage, for instance, dominant white and frizzle plumage (Hutt 1933). The 
distance between two loci, a function of the frequency of crossing-overs at meiosis, 
is estimated statistically by calculating the frequency at which phenotypes are 
transmitted together from parents to offspring. A set of loci linked together is a 
linkage group, and in theory one linkage group per chromosome should be obtained. 
However, if the density of markers is too low, some chromosomes may not be 
represented at all, and two linkage groups belonging to the same chromosome may 
appear as segregating independently. The numerous mutations affecting easily 
observed phenotypes such as plumage colour, comb shape, feather shape, skin 
colour or skeleton size, and considered as essential characteristics in the description 
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of breeds, were used to build genetic maps as early as 1933 (Hutt 1933). These were 
followed by a collaborative effort to produce what was later called the classical 
genetic map (Bitgood 1993). However, due to low marker density, these maps were 
quite crude, with only a few chromosomes represented, each of them only partially, 
and molecular markers were thereafter used to solve this problem. 

Restriction fragment length polymorphisms (RFLP) were the first generation of 
molecular or DNA-based genetic markers that could be used on a large scale and 
were successfully used for the construction of a first genetic map in human (Donis- 
Keller et al. 1987). The long-term goal was to map genetic diseases and possibly 
identify the genes involved in their aetiology. Following this path, a similar approach 
was thought possible for mapping QTL governing the expression of agronomic traits 
in farm animals, including chicken (Soller and Beckmann 1986). The idea was that if 
one or several genetic markers positioned close enough to the DNA polymorphism 
underlying the phenotypic variation could be identified by linkage analysis, these 
could thereafter be used to develop a genetic test for the selection of favourable traits. 
Of special interest are traits that are difficult to measure and therefore not used in 
classical selection, such as genetic resistance to diseases such as coccidiosis or 
Marek’s disease, or to colonisation by Salmonella. The first molecular genetic map 
in chicken composed of RFLP markers was published in 1992 (Bumstead and 
Palyga 1992), but by that time microsatellites, a new generation of genetic markers, 
were shown to be far easier to use by PCR and to have a much higher number of 
alleles, allowing for higher mapping precision (Litt and Luty 1989; Weber and May 
1989). High-density genetic maps are required, and a chicken consensus genetic map 
containing 1889 markers was produced through a worldwide collaborative effort 
(Groenen et al. 2000). This map proved very useful for the initial mapping of 
Mendelian traits such as naked neck or polydactyly (Pitel et al. 2000) or QTL such 
as meat quality (Nadaf et al. 2007) or resistance to salmonella (Mariani et al. 2001). 
With the advent of technologies allowing for the genotyping of single nucleotide 
polymorphism (SNP) in high numbers and at a low cost, the chicken genetic map 
was greatly improved to a map containing 9268 markers (Groenen et al. 2009), and 
now a chicken genotyping chip containing over 580,000 SNPs is available (Kranis 
et al. 2013). With such high marker densities, the mapping of traits does not rely 
anymore on linkage mapping in pedigreed populations but on genome-wide associ¬ 
ation studies (GWAS) in populations. This approach, by taking the historical 
crossing-overs having occurred in populations into account, allows for higher 
precision in mapping, as, for instance, in Rome et al. (2015). 

Although genetic maps were made in first instance for mapping QTL, they later 
proved essential for assembling the chicken genome sequence. As a few small 
microchromosomes are still absent from the genome sequence assembly, efforts 
towards their inclusion involve different approaches including linking together 
genomic sequence scaffolds currently non-assigned to chromosomes by means of 
additional genetic markers and linkage groups, in addition to performing long-read 
sequencing (Warren et al. 2017). 
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2.4 Physical Maps: A Step Towards Positional Cloning 

MAS requires genetic markers positioned as close as possible to the DNA polymor¬ 
phism underlying the phenotypic variation of interest, and ideally the causative 
polymorphism itself should be used, whose identification could only be done by 
the positional cloning approach. When using the first-generation microsatellite 
genetic maps, QTL mapping had a precision in the order of several hundred 
kilobases (kb), intervals typically containing dozens of genes. The only sequence 
information then available was restricted to the few hundred base pairs (bp) that were 
sequenced to design the PCR primers for the markers, and new sequence information 
had to be developed in the mapping interval to develop additional markers. The best 
option for this was then to use large insert-containing DNA clones identified as being 
in the interval of interest. Several efforts were undertaken, first to construct BAC 
libraries (Crooijmans et al. 2000; Romanov et al. 2003) and thereafter to order them 
into a physical map (Ren et al. 2003; Wallis et al. 2004). As for the genetic map, 
beyond their primary purpose for positional cloning, physical maps also later proved 
essential as backbones for the whole genome assembly and later on for sequencing 
regions missed by the shotgun approach. Such regions were chosen because they 
contained badly assembled sequence, large gaps in the genome assembly or genes of 
particular interest that were missed for various reasons such as high repeat content. 


2.5 Expression Maps, Comparative Maps and Low Chromosomal 
Rearrangement Rates in the Bird Lineages 

The percentage of vertebrate genomes coding for proteins is very small. For instance 
the annotated exons of all genes cover only 2.94% of the human genome and the 
protein-coding exons as little as 1.2% (ENCODE Project Consortium et al. 2012). 
With the limited sequencing capacities available in the 1990s, a shortcut to the vital 
information on coding gene information was to sequence spliced transcripts from 
messenger RNA extracted from diverse tissues. The first large-scale project for such 
EST (expressed sequence tag) production sequenced 64 cDNA libraries from 
21 adult and embryonic tissues (Boardman et al. 2002). The development of genetic 
markers within these ESTs representing gene sequences allowed mapping them on 
chromosomes. These EST/transcript maps provided candidate genes in QTL 
intervals and allowed to perform the first comparison of bird and mammalian 
genomes at the chromosome scale using paralogous genes as anchor points, 
suggesting different dynamics of chromosome evolution between the mammalian 
and avian lineages (Burt et al. 1999). The sequence of gene transcripts is also 
essential for the annotation of the genome assembly, by detecting exons. 
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2.6 The First Chicken Genome Assembly 

Chicken was amongst the first large vertebrate genomes sequenced (Hillier et al. 
2004), very soon following the human (Lander et al. 2001; Venter et al. 2001), 
mouse (Consortium et al. 2002) and fugu fish (Aparicio et al. 2002) ones. It is 
contemporary to the rat genome sequence release (Gibbs et al. 2004). Takifugu 
rubripes is a special case, as its genome was sequenced due to its very small size 
(400 Mb) in order to have a representative vertebrate genome at low cost, and 
chicken completed well this first broad picture of vertebrate genomes. The fact 
that chicken was sequenced early was partly due to its importance as agricultural 
and biological model and to the fact that genome maps and other resources were 
available for aiding the assembly but also mainly due to its phylogenetic position. 
Indeed, the annotation of the human genome, including the detection of coding 
genes, relied on many approaches amongst which whole genome comparisons of 
phylogenetically close and distant species for the detection of conserved sequences. 
For this approach, having a bird whole genome sequence was essential in addition to 
the few mammalian and fish genomes available or being sequenced in the early 
2000s (Margulies 2003). The size of the current GRCgba chicken assembly is 
1065 Mb, which is much smaller than the typical mammalian genomes, whose 
sizes are in the order of 2500-3000 Mb. 

The chicken sequence revealed many interesting features, many of which pertain 
to differences between macrochromosomes and microchromosomes. Indeed, gene 
density and CpG content in the latter was found to be higher, as was already 
suggested by former results (McQueen et al. 1996, 1998). Alignment of the genetic 
map to the sequence showed higher recombination rates for microchromosomes 
(Hillier et al. 2004; Groenen et al. 2009), and comparison to very partial genomic 
sequences from another bird indicated higher rates of nucleotide divergence 
(Axelsson et al. 2005). The very low rate of interchromosomal rearrangements in 
the chicken lineage was confirmed by comparing the chicken genome with that of 
human, mouse and rat (Bourque et al. 2005), supporting the fact that most bird 
karyotypes are so similar. Other interesting features of vertebrate genomes could be 
investigated by analysing the chicken genome. For instance, a comparison of the 
human, mouse and chicken genome showed that some gene deserts have an unex¬ 
pected stability, resisting chromosomal rearrangements and thus suggesting the 
existence of distant regulatory elements that need to be physically linked to their 
neighbouring genes (Ovcharenko 2005). However, despite its undeniable qualities, 
this first assembly was incomplete, with 10 of the smallest microchromosomes 
missing altogether, out of a total of 38 autosomes. The Z chromosome was poorly 
assembled due to the half coverage obtained through sequencing a ZW female 
individual, and only little of the W chromosome was present due to the same reason 
and also to a high repeat content. The sequence of the Z and W chromosomes was 
completed specifically by targeted sequencing of BAC and fosmid clones, based on 
maps (Bellott et al. 2010, 2017). Progress towards sequencing the missing 
microchromosomes has since been very slow, needing the construction of specific 
radiation hybrid or genetic maps, and is a work still under progress (Douaud et al. 
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2008; Warren et al. 2017). At the time of writing, six microchromosomes (GGA29 
and GGA34-38) are still missing in the GRCg6a assembly. 


2.7 The Missing Microchromosomes and Towards a Complete 
Genome 

Obviously, one major reason for the absence of the smallest microchromosomes in 
the genome assembly is their very small size, with an estimated size of the smallest 
microchromosome of 3.4 zb 0.26 Mb (Pichugin et al. 2001). This makes their 
identification by cytogenetic analysis extremely difficult. However, other reasons 
account for this fact. One of them is their extremely high (G + C) content, which was 
first suspected by classical cytogenetic studies (Auer et al. 1987). Regions with 
abnormally high (G + C) contents tend to resist cloning and sequencing, and each 
time a new technology comes into use, one important point to consider is its ability to 
improve the quality of the sequencing of (G + C)-rich regions. For instance, the 
chicken protamine gene is an extreme example. It was sequenced in 1989 by 
classical manual Sanger sequencing and shown to have a (G + C) content, as high 
as 88% in the coding region (Oliva and Dixon 1989). Consequently, this sequence 
could not be found in the genome assemblies and only appears in the latest Galgal5 
and GRCg6a versions of the genome, which were improved by using the Pacific 
Biosciences (PacBio) long-read sequencing technology having an improved toler¬ 
ance for extreme (G + C) contents (Shendure et al. 2017). This improvement in 
coverage of (G + C)-rich regions can be observed as sequencing technologies 
progress. The (G + C) content of the smallest microchromosomes in the first genome 
assembly reached 50% (Hillier et al. 2004), whereas in the Galgal5 assembly, 
chromosomes 30 and 23 have (G + C) contents higher than 60% (Warren et al. 
2017). Another reason for the difficulties in sequencing some microchromosomes 
can also be due to other specificities. FISH mapping showed that some of them can 
have a very high repeat content. Indeed, single cosmid or BAC clones sometimes 
give a very strong painting signal on one or several chromosomes at a time (Fillon 
et al. 1998). The repeat content in avian genomes is discussed by Weissensteiner 
et al. (2019). Chromosome 16 is another specific case and is a well-known example 
of a microchromosome presenting specific challenges for sequencing. It contains the 
two major histocompatibility complexes MHC-B and MHC-Y (Fillon et al. 1996) 
and for that reason has been studied extensively. However, these two important gene 
complexes represent only a small part of the chromosome, the rest being occupied by 
the 18S, 5.8S and 28S ribosomal RNA genes, repeated about 150 times, by P041 
repeats and by olfactory receptor and scavenger receptor gene families (Miller and 
Taylor 2016). Finally, chromosome 31 contains the chicken Ig-like receptor (CHIR) 
complex, a large gene family of more than 100 genes that have evolved by recent 
duplications (Laun et al. 2006), therefore causing specific problems for assembly. 
The specific comparison of sequences from the reference assembly to published 
CHIR gene sequences obtained elsewhere (Laun et al. 2006) allowed their inclusion, 
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Table 1 Gallus galius genome assembly statistics 


Statistic 

Galgal2.1 

Galgal4 

Galgal5 

GRCg6a 

Nb. autosomes 

30 

30 

32 


Length of assembly (Mb) 

1098 

1046 

1230 

1065 

Contig-N50 (kb) 

46.345 

279.750 

2894.815 

17655.422 

Scaffold-N50 (Mb) 

11.063 

12.877 

6.379 

20.7 

Scaffold-N75 (Mb) 

4.432 

5.752 

1.611 

12.7 

Scaffold-N90 (Mb) 

0.971 

2.116 

0.014 

6.1 

Scaffold-L50 

26 

23 

47 

12 


and as a consequence the inclusion of chromosome 31, in Galgal5 (Warren et al. 
2017). 


2.8 Statistics on the Chicken Genome Assemblies 

Statistics on the four major chicken genome assemblies are indicated in Table 1. 
Galgal3, sometimes also referred to as Galgal2.1, was generated mainly by Sanger 
sequencing, with the addition of some targeted sequencing of BAC and fosmid 
clones. Galgal4 saw the addition of first-generation NGS data to the Sanger assembly 
in the form of Roche 454 sequences. Galgal5 is a new assembly starting with 50 x of 
PacBio sequence, corrected by 36 x Illumina paired-end reads; assembled with the 
help of BAC-end sequences and of improved physical, genetic and RH maps; and 
completed with finished BAC clones, chosen specifically to close gaps (Warren et al. 
2017). This resulted in the addition of two microchromosomes and of 183 Mb of 
sequence to the assembly, having thus expanded to a total length of 1.21 Gb. 
Another major improvement is the continuity of the sequence, with an increase of 
the contig N50 from 279 to 2894 kb, meaning that the number of gaps in the 
sequence is much lower. The latest assembly, GRCg6a, is composed of 82 x 
coverage of Pacific Biosciences reads and shows a dramatic increase of both contig 
and scaffold N50, together with a shorter overall assembly length (Table 1). 

Finally, since the release of Galgal5, chicken is included in the Genome Refer¬ 
ence Consortium (GRC) (Church et al. 2011), together with the genomes of human, 
mouse and zebrafish, which are the only other species included at the time of writing. 
The GRC allows having community input and continued curation of reference 
genome assemblies and provides the access to patches, which are accessioned 
scaffold sequences that represent assembly updates. These patches add information 
to the assembly without disrupting the chromosome coordinates, allowing for 
early access to new information. Sequencing information providing alternate 
representations of haplotypes for multi-allelic loci can also be included and will 
especially be important for regions such as the MHC or the CHIR complexes. 
Interestingly, long-read sequencing technology allows the production of phased 
sequences and haplotyping (Korlach et al. 2017). The fact of being able to represent 
sequence insertions, deletions and inversions in the genome reference will also 
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increase its usefulness as such events will be detected with more ease in sequenced 
populations. 


2.9 Understanding the Genome: The Annotation 

Once a genome is sequenced, a lot still has to be done to make sense out of this raw 
information. Eric Lander, one of the leaders of the human genome project in 2003, 
summarised this quite clearly: “Genome: Bought the book; hard to read”. And he 
also mentioned the fact that “The only problem is: there’s no index” (http://www. 
improbable.com/airchives/paperair/volume9/v9i6/nano/nano_6.html). And at that 
time, one of the main problems that had to be figured out was only to try and 
estimate correctly the number of genes coding for proteins which are contained in the 
genome, a problem still not entirely resolved (Pertea et al. 2018)! Since then, efforts 
have been devoted towards understanding many other aspects of genome biology, 
such as finding target sequences for regulatory elements, studying histone marks and 
chromatin accessibility. Much of this work was done on a genome-wide basis within 
the frame of the Encyclopedia of DNA Elements (ENCODE) project (ENCODE 
Project Consortium et al. 2012). Structural annotation is the task of finding the 
functional elements in the genome, whose implications in the biology of the organ¬ 
ism will be further specified by functional annotation, either by direct experimental 
evidence or by similarity to orthologous genes or sequences in other organisms. This 
annotation concerns of course in first instance the genes coding for proteins but also 
the non-coding genes, mainly microRNAs and long non-coding RNAs, and finally 
all the regulatory elements such as promoters, enhancers or chromatin domains, all 
of which having potential roles in the genetic part of phenotypic variation via DNA 
polymorphism. The most obvious approach to finding transcribed genes is to align 
messenger RNA sequences, or even better protein sequences if available in the case 
of coding genes, on the genome assembly (Yandell and Ence 2012). Numerous 
efforts since the first large-scale of EST production in chicken (Boardman et al. 
2002) were made. The three main problems when producing transcribed sequence 
data for structural annotation are (1) to ensure that a wide variety of tissues are 
sampled, so as to obtain all possible tissue-specific genes, (2) to normalise or to 
sequence at a high enough depth for detecting genes expressed at a low level and 
(3) to produce long sequence reads so as to encompass complete genes and to detect 
all possible alternative splicing events. Large efforts are still underway through 
transcript sequencing, to improve the gene annotation in chicken (Zhang et al. 
2017; Muret et al. 2017; Kuo et al. 2017). Much of the genetic variation underlying 
quantitative traits is likely to be located in regulatory sequences, and epigenetic 
mechanisms are also likely to play an important role in phenotypic expression. To go 
further towards characterising the functional elements of the genome, an internation¬ 
ally coordinated project, Functional Annotation of Animal Genomes (FAANG), 
aims at going deeper into RNA sequencing, studying histone marks and chromatin 
accessibility (Andersson et al. 2015). This project works in a similar fashion to the 
ENCODE project but on several farm animal species. Finally, another approach 
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towards detecting functional elements in the genome is to detect constrained 
sequences, which are conserved across evolution, suggesting their importance. 
This approach was used in mammalian genome biology, first for the detection of 
exons and for counting human genes (Roest Crollius et al. 2000) and later for 
identifying other functionally constrained elements of the genome (Davydov et al. 
2010). With the availability of numerous bird genomes spanning the avian phyloge¬ 
netic tree, a similar approach can now be taken to detect elements conserved between 
birds and mammals and also elements specific to birds. 


3 The Breeder's Perspective on the Genome 

3.1 Marker-Assisted Selection, Genomic Selection and QTL 
Identification 

The primary idea behind studying the chicken genome from a breeder’s perspective 
was to provide tools in the form of DNA markers, to be used for MAS and now more 
generally for genomic selection (GS). Marker-assisted selection was the first 
utilisation of molecular genetic markers envisaged right at the beginning of molecu¬ 
lar genetics and was the reason for the development of the first genetic maps in the 
early 1990s. At the time, little was known on the molecular makeup of genomes, and 
genotyping capacity was limited. Genetic markers identified through linkage to 
QTLs were to be used subsequently for selecting favourable alleles of practical 
interest in the breeding populations. Now, except for very specific cases in which a 
phenotype cannot be measured routinely, such as disease resistance, MAS has been 
replaced in the breeding industry by GS, thanks to the high-density SNP chips 
containing over 580,000 markers in chicken and to affordable genotyping methods. 
For GS, all that is required in terms of biological information for the markers is that 
they cover the genome in such a way that the quantitative trait locus (QTL) to be 
selected for are in linkage disequilibrium with at least one of them (Goddard and 
Hayes 2007). In GS, phenotypes still have to be measured, but not on such a large 
scale as previously done. Briefly, the association of genotypes and phenotypes is 
done in a training population, and for other individuals the phenotypes are predicted 
by using only the genotyping information. Using predicted phenotypes also allows 
to reduce the generation time by selecting for the best individuals at young age, 
which can be especially interesting in the case of phenotypes measured late in life or 
estimated through the measure of offspring, for instance, selecting roosters on the 
basis of predicted egg production parameters of their offspring. 


3.2 Identification of Genomic Events Causing Variation 
in Ornamental and Colouration Genes 

Since the domestication of animals, man has often spotted out unusual ornamental 
traits, which he has selected either for pleasure or because they were associated with 
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production traits. In the recent centuries, these ornamental traits have often been kept 
within specific breeds and are considered as important characteristics described in 
the breed standards. These can be quite precise, with descriptions including, for 
instance, the shape and colour of the body, the head, the tail, the legs, the crest, the 
wattles, etc. The description of the plumage is also important, as feathers can be of 
varying length and be striped or frizzled and as body coverage can vary, with some 
breeds having naked heads and necks or varying amounts of feathers on the legs. The 
colour of the eggshell can also sometimes be unusual, with dark brown or even blue 
colours. This wealth of diversity, together with the availability of genetic markers 
and the genome sequence, has allowed the identification of a number of mutations 
having impact on a number of qualitative traits when the genetic determinism is 
simple, usually monogenic. 

The polydactyly phenotype is an autosomal dominant trait, characterised by the 
presence of one or two extra toes on one or both feet of chicken. Similar to what was 
found in human, mouse and cat, a mutation in a regulatory region in the intron 5 of 
the Imbrl gene on chromosome 2 was shown to cause ectopic expression in the 
developing limb of Shh , located almost 0.5 Mb away, a major ligand in the 
Hedgehog signalling pathway and having a key role in organogenesis (Dunn et al. 
2011 ). 

The pea-comb phenotype is a reduction in the size of the comb and the wattles. 
The genetic determinism for this trait is a copy number variation in the intron 1 of the 
Sox5 gene on chromosome 1, coding for a transcription factor playing a role in 
chondrogenesis and causing a transient ectopic expression of this gene. This 
duplicated sequence is not conserved across species, suggesting it does not have 
an important function by itself, but may disturb the action of regulatory elements in 
the region (Wright et al. 2009). A variation between 20- and 40-fold in the copy 
number, instead of 2 copies in wild-type animals, could possibly explain the 
observed variable expression of the pea-comb phenotype. Interestingly, the Sonic 
Hedgehog signalling pathway is also involved in the expression of this phenotype 
(Boije et al. 2012). 

The rose-comb mutation is another phenotype involving comb morphology and is 
usually associated with sperm motility defects. Interestingly, the rose-comb and 
pea-comb genes show epistatic interactions, with the individuals carrying both 
mutant alleles having the walnut-comb phenotype. The rose-comb phenotype was 
explained in most individuals by a 7.4 Mb inversion on chromosome 7 and in some 
other individuals by a recombination between a chromosome with the 7 Mb inver¬ 
sion and a wild-type chromosome. For the 7.4 Mb inversion, the proximal 
breakpoint is in the 5' UTR of FKBP7 , 72 bp upstream of the start codon, this 
position being also 42 bp from the 5' UTR and 150 bp from the start codon of 
PLEKHAS3. The distal breakpoint is in intron 3 of CCDC108 , a part of the sperm 
flagellum, which could explain the sperm motility defects. Finally, it was shown that 
it is the transient ectopic expression in mesenchymal cells of MNR2 , located 3 kb 
inside the distal breakpoint, which is responsible for the rose-comb phenotype. This 
transient expression of MNR2 could be due to the change of genomic context for the 
gene, following the inversion. Finally, transient ectopic expression of both SOX2 and 
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MNR2 in the same population of mesenchymal cells could explain the walnut 
phenotype (Imsland et al. 2012). 

A third major comb phenotype, duplex comb, is also explained by the ectopic 
expression of a transcription factor, eomesodermin also known as T-box brain 
protein 2 (Tbr2), but this time in the ectoderm. The mutation is a 20 kb tandem 
duplication containing several conserved putative regulatory elements located 
200 kb upstream of the eomesodermin gene (Dorshorst et al. 2015). 

Another cranial phenotype is Crest, in which the chicken has a tuft of elongated 
feathers on the top of its head. In this case, the exact mutation is not yet known, but it 
was mapped very close to the Hoxc8 gene, showing ectopic expression in the cranial 
skin during development, and is therefore very probably also a regulatory mutation 
(Wang et al. 2012). 

The naked neck phenotype, characterised by a lack of feathers on the head and 
neck, is associated with a large insertion of sequence from chromosome 1, containing 
the Wntll and Uvrag genes, approximately 260 kb downstream the GDF7/BMP12 
genes on chromosome 3. This insertion is linked with an increased expression level 
of GDF7/BMP12 , possibly by a long-range action of Wntll enhancers (Mou et al. 
2011 ). 

The case of colouration genes is complex, and classical genetics has shown that 
the numerous colouration phenotypes observed in chicken are due to complex 
epistatic interactions. Chicken can be, for instance, white, silver, lavender, buff, 
red, various shades of brown or black. And also, the distribution of the colours on the 
body can vary, with specific colours on the breast, the head, the tail, the legs, etc. 
Colours can also be distributed in speckles or even as stripes on the feathers, such as 
in the barred phenotype. A few mutations have been identified, explaining part of 
this diversity, some of which pointing to mechanisms conserved across vertebrates 
(Gross et al. 2009). Mutations in the MC1R (Melanocortin 1-receptor) gene at the 
extended black locus E alter the relative amounts of eumelanin (black-brown 
pigment) and phaeomelanin (yellow-red pigment) in melanocytes. Some missense 
mutations in the gene are associated with a constitutively active receptor causing a 
dominant black phenotype, while one other a loss-of-function causing lighter colours 
(Kerje et al. 2003). Mutations in the SLC45A2 gene whose product is important for 
vesicle sorting in melanocytes cause the silver colour (missense mutations) or 
sex-linked imperfect albinism (a frameshift and a premature stop codon) 
(Gunnarsson et al. 2007). Phenotypic variation controlled by the dominant white 
locus I is caused by mutations in the PMEL17 gene, coding for a melanocyte-specific 
protein having a role in the development of eumelanosomes, pigment granules 
containing eumelanin in melanocytes. A 9 bp insertion in exon 10 adds three 
amino acids in the transmembrane region and causes the dominant white phenotype. 
A five-amino acid deletion in the transmembrane region causes the Dun (whitish) 
phenotype. Finally, a 9 bp insertion in exon 10 associated with a 12 bp deletion in 
exon 6 causes the Smoky (greyish) phenotype (Kerje et al. 2004). A retroviral 
insertion in intron 4 of the tyrosinase gene, essential for the synthesis of both 
eumelanin and phaeomelanin, is associated with the recessive white phenotype at 
the C locus (Chang et al. 2006), while a six-amino acid deletion in the gene causes 
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autosomal albinism (Tobita-Teramoto et al. 2000). The lavender phenotype is due to 
a dilution of both the eumelanin and phaeomelanin pigments caused by a missense 
mutation in the MLPH (melanophilin) gene (Vaez et al. 2008). The dark brown 
phenotype at the DB locus is associated with an 8 kb deletion upstream the SOX10 
gene, having a role in melanocyte biology. A variation in the level of SOX10 
expression could have an effect in turn on the expression of key enzymes of pigment 
synthesis, such as tyrosinase (Gunnarsson et al. 2011). Finally, the sex-linked 
barring phenotype, characterised by striped feathers, has several allele morphs of 
varying dilution. These are due to different non-coding and coding changes in the 
ARF transcript of the CDKN2A tumour suppressor locus (Schwochow Thalmann 
et al. 2017). Interestingly, for most of these loci, the same genes are involved in 
similar phenotypes in other vertebrate species. For instance, MC1R is also involved 
in black colouration in wolves (Anderson et al. 2009) or in fishes (Gross et al. 2009). 
Colouration genes can act on other parts of the body. The blue egg phenotype is due 
to an endogenous avian retroviral insertion (EAV-HP) close to the SLC01B3 gene 
responsible for its overexpression in the oviduct, probably a cause for the accumula¬ 
tion of biliverdin in the eggshell gland; interestingly, two independent but similar 
mutations are responsible for the phenotype in Chinese and South American chicken 
(Wang et al. 2013; Wragg et al. 2013). The yellow skin is due to a cA-acting 
mutation inhibiting the expression of BCD02 (beta-carotene dioxygenase 2) in the 
skin (Eriksson et al. 2008). 

Other phenotypes having molecular mechanisms identified include the silky 
feathers due to a ds-regulatory mutation of PDSS2 (Feng et al. 2014) and frizzle 
feather to an alpha-keratin mutation (Ng et al. 2012). 

As seen in these cases of simple monogenic phenotypic traits having a high 
penetrance, the identification of the causal variant is possible, especially if it involves 
a change in the coding sequence. In the case of regulatory mutations, this can be 
much more challenging and is mostly possible if they are due to large events that are 
easy to detect such as insertions, deletions or insertions or if there is a deep 
understanding of the biology underlying the trait, stemming from basic research. 


3.3 The Complex Case of Quantitative Trait Loci (QTLs) 

Identifying the molecular mechanisms underlying QTLs is much more challenging 
than for the monogenic traits. The initial QTL mapping strategies based on F2 or 
backcross designs crossing chicken lines differing for the traits of interest allowed 
the mapping of hundreds of QTLs (Abasht et al. 2006), and in a few instances, 
causative genes could be identified, such as the BCMOl gene as a player in meat 
colour (Le Bihan-Duval et al. 2011). A review on the mapping of QTLs related to 
growth, egg quality, behaviour, metabolism and egg resistance (Abasht et al. 2006) 
gives an idea on the diversity of traits studied and also shows the co-localisation of 
QTL from independent studies and therefore their confirmation. More data on QTLs 
can be found at the ChickenQTLdb website: http://www.animalgenome.org/cgi-bin/ 
QTLdb/GG/index. However, the mapping intervals that could be defined with such 
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strategies were very large, usually several Mb containing dozens of genes due to the 
limited number of recombinants observable in such designs. For instance, in a 
backcross design, these will be equal to the number of backcross offspring multiplied 
by the number of recombination events, which will range in chicken between 0.5 for 
microchromosomes and 5 for macrochromosomes (Groenen et al. 2009). To refine 
mapping precision, advanced intercross lines (AIL) can be made by crossing over 
subsequent generations. For instance, an AIL between a layer and a broiler line 
performed over 16 generations allowed to map a growth QTL with high precision on 
chromosome 4, involving a variation of expression of the satiety signal receptor gene 
CCKAR (Dunn et al. 2013). However, in most cases, this approach using specific 
crosses and divergent populations did not allow identifying genes underlying QTLs 
and thanks to the availability of high-density SNP chips, the more recent studies can 
now be based on GWAS, allowing for higher precision in mapping. The additional 
benefit is that the QTLs are detected directly in the populations of interest to breeders 
(Rome et al. 2015), whereas one could not be certain that QTLs found in experimen¬ 
tal crosses would also segregate in the populations of interest. 

The challenge behind elucidating molecular genetic mechanisms underlying 
QTLs is also expected to be much higher than for monogenic traits due to their 
possible determinism, either due to a polygenic nature, to strong interactions with the 
environment or to complex regulatory mechanisms involving non-coding sequences, 
such as promoters, enhancers, regulatory RNA, etc. Indeed, while it is often possible 
to correlate the expression level of genes with that of traits in transcriptomic studies, 
the causes of these variations at the genome level, which are of interest to the breeder 
for selection of traits, remain elusive. The GWAS approach will help refine the 
location of causative mutations in the case of quantitative traits, but as seen for 
complex traits in human, causative variants may be difficult to identify, and further 
information on genome biology such as the definition of regulatory regions will 
often be required. A typical example is the identification of a SNP having an allele 
causing obesity in human. This identification was greatly facilitated by the existence 
of a series of whole genome datasets on different epigenetic marks in numerous 
tissues (Roadmap Epigenomics Consortium et al. 2015). Mining this dataset allowed 
to restrict the region suggested by the GWAS to part of an intron of the FTO gene 
having characteristic enhancer properties only in mesenchymal adipocyte 
progenitors. This last information allowed for functional tests to be performed 
using the correct tissue, demonstrating the effect of the FTO mutation on both the 
IRX3 and IRX5 genes located downstream (Claussnitzer et al. 2015). 
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4 Exploiting Breeding Populations and Model Organisms 

for Bird Biology 

4.1 Domestication Syndrome 

The domestication syndrome refers to a combination of traits that are often found 
together in domesticated animals that are not found in their wild ancestors. These 
include increased tameness, variations of coat colour, reduction of craniofacial 
dimension, nonseasonal oestrus cycles, prolongation of juvenile behaviour and 
reduction in brain size (Wilkins et al. 2014). Such traits can be observed in domestic 
chicken, and progress on understanding the domestication syndrome in animals will 
benefit from comparisons with the wild Gallus gallus ancestor. See, for instance, the 
link between tameness and brain size in Agnvall et al. (2017). A recent hypothesis 
suggests the domestication syndrome traits are related to altered neural crest cell 
development (Wilkins et al. 2014) and chicken is again an interesting model thanks 
to its importance in embryology (Theveneau and Mayor 2012). In domestic animals, 
it is however not always easy to separate selection due to the early stages of 
domestication, from subsequent selection on production traits (Loog et al. 2017). 
Candidate domestication genes were identified through the comparison of modern 
chicken breeds and the red jungle fowl Gallus gallus , the major wild relatives. The 
two major genes currently discussed are the thyroid-stimulating hormone receptor 
TSHR and the beta-carotene dioxygenase 2 BCD02 (Rubin et al. 2010). TSHR plays 
a role in the photoperiodic control of reproduction, and selection could be linked to 
the fact that domesticated animals tend to lose any strict seasonal reproduction. In the 
case of layer chicken, selection has in fact gone very far in that respect, towards the 
production of a very high number of eggs all year round. The allele present at high 
frequency in domestic chicken has since been directly associated with photoperiodic 
response, reproduction, reduced aggressive behaviours and decreased fear of 
humans (Karlsson et al. 2015, 2016). BCD02 cleaves carotenoids in the skin and 
when inactive causes the yellow skin colouration due to their accumulation. The 
yellow skin allele is thought to come from the grey jungle fowl Gallus sonneratii , a 
close relative to the red jungle fowl Gallus gallus , suggesting a hybrid origin of 
domestic chicken (Eriksson et al. 2008). It was first suggested that the selection on 
these two genes took place at the very beginning of chicken domestication 
6000 years ago, but new evidence produced by sequencing ancient DNA from 
chicken remains in archaeological sites points towards more recent events (Flink 
et al. 2014; Loog et al. 2017). The selection on the TSHR gene is now thought to 
have taken place during the Middle Ages, at a time when the proportion of chicken 
remains in archaeological data was increasing presumably due to changes in diet 
preferences in humans causing higher flock densities and increased egg production. 
The introgression of the BCD02 yellow skin allele can be attributed to recent gene 
flow 100-150 years ago from Asia at the time of the creation of the modern chicken 
breeds, and its prevalence varies according to breeds. 
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4.2 Epigenetic Mechanisms Specific to Birds: Absence 
of Imprinting of Z Dosage Compensation 

Epigenetic mechanisms affecting the expression of genes according to development 
stage, tissue type or to the environment are increasingly studied. Having a bird 
model organism is essential to study these mechanisms from an evolutionary 
standpoint and to deepen our understanding of mechanisms specific to birds. There 
are two main examples in which bird genomes have epigenetic mechanisms that are 
quite different to those of mammals. The first is dosage compensation of the 
gonosomes, which allows for similar expression levels of chromosome X genes in 
mammalian in female (XX) or male (XY) cells. The second is imprinting, whereby 
only the paternal or maternal copy of a gene is expressed. 

In birds, contrariwise to mammals, the heterogametic sex is the female, having Z 
and W gonosomes. As gene expression levels usually correlate with gene dose, the 
expression level of dosage-sensitive genes must be corrected to have equivalent 
values in the cells of homogametic and heterogametic individuals. Several 
mechanisms are possible for this, and, for instance, in mammalian females, one of 
the X chromosomes is inactivated, and except for a few genes escaping inactivation, 
gene expression is only from the active chromosome (Augui et al. 2011). Analyses 
of male/female ratios of chicken and zebra finch chromosome Z gene expression in 
several tissues showed that dosage compensation in chicken is incomplete, with on 
average higher expression levels, although not double as expected with no compen¬ 
sation at all, in males than females (Itoh et al. 2007). The dosage compensation effect 
varies along the chromosome, with a cluster of completely compensated genes close 
to the male hypermethylated region on chromosome Zp (Melamed and Arnold 
2007). Recent analysis of allele-specific expression showed that for the majority of 
genes expressed in males, both alleles were expressed at similar levels, suggesting 
that the overall reduction of expression is not due to the inactivation of one of the two 
chromosomes, but to a different mechanism (Wang et al. 2017). 

Genomic imprinting is the imbalance in parental gene expression observed for 
some genes, in which only the alleles of maternal or paternal origin are expressed. 
The mostly accepted theory explaining imprinting is that of the parental conflict in 
which the genome of paternal origin expressed in an embryo will favour its growth at 
the expense of the mother, whereas the genome of maternal origin will tend to 
restrict the use of resources, allowing for the mother to remain fit for producing more 
offspring in the future (Haig 2014). According to this theory, it is usually accepted 
that imprinting should be restricted to organisms in which maternal resources 
directly affect the embryo and to genes whose functions are related to its growth. 
The development of the bird embryo takes place in an egg, and therefore no 
imprinting should be found. However, it was also shown that a number of imprinted 
genes are also expressed in the brain and can influence adult phenotypes related to 
the level of investment in reproduction, for instance, maternal care. This will be 
particularly true in cases where a mother’s genes of paternal or maternal origin may 
be unequally related to other individuals in a group (Haig 2014). Whole genome 
transcriptomic analyses were performed in chicken embryo (Fresard et al. 2014) and 
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brain (Wang et al. 2015), with the intent of detecting eventual parent-of-origin allele- 
specific expression, but no sign of imprinted genes could be found. 


4.3 Speciation 

The interesting questions of the mechanisms of speciation can be addressed with 
genomics (see also Ottenburghs 2019). Only a limited number of speciation genes 
have been discovered to date (Mack and Nachman 2017), and PRDM9 is the only 
one that was found in vertebrates (Mihola et al. 2009). Positive selection on PRDM9 
is responsible for the rapid relocation of recombination hotspots across populations 
and species (Stevison et al. 2016), thus provoking an arrest of meiotic prophase 
(Davies et al. 2016). However, PRDM9 is not present in all vertebrates and more 
specifically is absent from all birds investigated so far (Baker et al. 2017). In 
accordance with this observation, recombination hotspots are stable in birds in at 
least two finch species investigated (Singhal et al. 2015). For such a study, a high- 
quality reference genome was required to avoid the appearance of false recombina¬ 
tion hotspots due to inversions in the assembly. 


5 Deep Understanding of the Genome 

Several approaches are currently undertaken to deepen our knowledge of the biology 
of genomes, including those of birds. The two main themes to identify functional 
elements are based either on comparative analysis of multiple genomes (Lindblad- 
Toh et al. 2011) or on biochemical assays to study defined regions of the genome 
exhibiting a particular biochemical signature (ENCODE Project Consortium et al. 
2012). With the emergence of novel molecular methods for genome editing (Zhang 
et al. 2014a), it is now possible to prove the functional significance of the identified 
functional elements, whether protein-coding or regulatory, which is crucial if the aim 
is to establish the link between certain phenotypes and the underlying genotypes. 

Although comparative genomics has been extensively used to define protein¬ 
coding gene repertoire of genomes, for example, by mapping protein-coding genes 
to the assembly (Jarvis et al. 2014; Yates et al. 2016; NCBI Resource Coordinators 
2017), full genome comparisons of multiple species provide a way to understand the 
fraction of a genome which evolves nonneutrally, either due to purifying or to 
positive selection (Lindblad-Toh et al. 2011). These selective signatures often 
correctly identify novel functional regions in genomes and so can be used to 
distinguish causative from linked genetic variants that exist in populations. Evolu¬ 
tion of protein-coding genes also leaves detectable traces in full genome multiple 
sequence alignments, thus providing a way to predict so far unannotated protein¬ 
coding genes (Lin et al. 2011). Nevertheless, apart from the above example, 
signatures of selection rarely indicate the exact biological function of the genomic 
segment, and so function should be experimentally tested. 
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Next-generation sequencing technologies, both short-read (e.g. Illumina) and 
long-read (e.g. Pacific Biosciences), are now readily available coupled with various 
biochemical assays (1) to characterise the complexity of transcriptome and its 
protein-coding and regulatory components (for a transcriptomic review targeted at 
non-specialist ornithologist readers, see Jax et al. (2018); (2) to define promoters, 
enhancers and other regulatory elements; (3) to characterise chromatin state and 
DNA-DNA interactions; and (4) to identify epigenetic modifications. Similar to the 
efforts made by the ENCODE project (ENCODE Project Consortium et al. 2012), 
such approaches are currently undertaken within the Functional Annotation of 
Animal Genomes project (FAANG) (Andersson et al. 2015) to deepen our knowl¬ 
edge of the chicken genome biology. 


5.1 Phylogenomics Analyses 

The first genome-wide comparative analysis of a bird was done when the chicken 
genome got sequenced and compared to that of the human. As the last common 
ancestor of mammals and birds lived approximately 310 million years ago, the 
nucleotide sequence has changed to such an extent that only 2.5% of the chicken 
genome could be aligned between the two species. Most of the conserved sequences 
were attributed to protein-coding regions, covering 75% of all coding exons, or 
promoter regions of genes, while the rest were shown to be clustered together in gene 
poor regions and were linked to known or predicted conserved regulatory elements 
(Hillier et al. 2004). Indeed, functional elements are frequently shared between 
species and are under selective constraint due to the effect of negative selection, 
which removes selectively disadvantageous alleles in populations. The proportion of 
the genome under negative selection and the distribution of constrained sites there¬ 
fore have some functional consequence and are important parameters of a genome 
(Lander et al. 2001; Eory et al. 2010; Lindblad-Toh et al. 2011). 

The resolution of constrained sites within a genome primarily depends on the 
total divergence rate across the phylogeny of species compared. The power to detect 
signatures of selection increases with the number of genomes sequenced and how 
broad the sampling was from the given phylogeny. With optimal sampling it is 
possible to increase the total divergence rates to such an extent that the effect of 
selection can be estimated at single nucleotide resolution (Lindblad-Toh et al. 2011). 

A map of selectively constrained sites in the human genome, based on a 29-way 
mammalian genome comparison, showed that over 5% of the human genome is 
under purifying selection and the study helped to detect candidate protein-coding 
exons, stop codon read-through events, novel RNA structural variants and 
constrained regulatory elements (Lindblad-Toh et al. 2011). 

To better understand avian genome evolution via comparative genomics, there 
was a need for the availability of multiple bird genome assemblies, broadly sampled 
across the avian tree. These were achieved by the collaborative effort of the Avian 
Phylogenetic Consortium (Jarvis et al. 2014), which sequenced and released 45 bird 
genomes and refined the phylogeny of birds. 
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The work of the Consortium shed light on many features of bird genome 
evolution. 

One of the many questions was why do birds have relatively small genome sizes 
of ~1 Gb, compared to reptiles and mammals which vary between 1 and 8 Gb (Zhang 
et al. 2014c). It was found that the main factors to maintain the observed smaller 
genome sizes of birds are first, the lower number of repeat elements (4-10%) relative 
to mammals (50%); second, the twice as high rate of small deletion events relative to 
reptiles and last, the high number (118) of lineage-specific large segmental deletion 
events relative to reptiles (Zhang et al. 2014c), which also led to a loss of 7% of the 
macrochromosomal genes which were found to be present in green anole. 

Despite the overall loss in number of genes, the number of gene gains and losses 
is relatively small in birds, and there are indications that gene turnover is lower in 
birds than in mammals (Zhang et al. 2014c). 

Many bird-specific morphological adaptations were traced back to changes at 
molecular level, including adaptive evolution in genes or in regulatory regions. 
While lower divergence rates frequently mark regions being under negative selec¬ 
tion, accelerated rate of evolution can indicate adaptive changes in the genome. For 
birds, adaptive changes were found in genes, which are crucial to enable the capacity 
for flight. Some of these genes or regulatory elements take part in bone development 
(Zhang et al. 2014c) others in the development and diversification of feathers (Zhang 
et al. 2014c; Lowe et al. 2015) or increased metabolism (Zhang et al. 2014c). 

As the sampling of birds included species representing the main lineages, it was 
possible to make comparisons, for example, of gene evolution between vocal-learner 
(hummingbirds, songbirds and parrots) and nonvocal-leamer birds. Two hundred 
and thirty-seven candidate genes were identified showing signs of adaptive evolution 
across vocal-learners, many of which are expressed in the brain of vocal-learner 
birds (Zhang et al. 2014c). 

When considering purifying selection, the 45 additional bird genomes greatly 
increased the resolution to detect selectively constrained regions in birds, and it was 
estimated that 7.5% of the avian genome was highly conserved (Zhang et al. 2014c). 
Nevertheless, when genome sizes are factored in, the per Mbp constraint contents are 
rather different between mammals and birds. While the human genome is predicted 
to contain around 160 constrained sites per 1 Mbp (Lindblad-Toh et al. 2011), the 
prediction for birds is only 75 nt (Zhang et al. 2014c). The more than twofold 
difference in constrained sites between birds and mammals raises interesting 
questions about what factors are influencing the constraint content of a genome. It 
is possible that the difference is caused by differences in the way constraint was 
calculated (i.e. purely technical), but this may also happen due to a weaker selection 
regime on bird genomes in general, or if birds actually maintain their physiology 
with smaller number of constrained functional elements or have more unique 
lineage-specific elements. Which of the above can explain the difference is remained 
to be seen. 

Ongoing negative selection on the protein-coding regions of genomes leaves 
phylogenetic signatures which are detectable in multiple sequence alignments. 
Mutations in protein-coding codons which would change the optimal amino acid 
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into an unfavourable one will be eliminated by purifying selection. The effect of 
selection is less severe on mutations resulting in amino acids with similar physico¬ 
chemical properties to the original amino acid. Also, substitutions at third nucleotide 
of fourfold degenerate codons (codons where the third nucleotide can freely change 
without changing the amino acid) evolve nearly neutrally and are not under strong 
negative selection (Ohta 1995; Lin et al. 2011). These unique selection signatures 
can be detected in multiple sequence alignment of protein-coding regions of a 
genome, and the phylogenetic coding potential can be estimated at codon per 
codon resolution by PhyloCSF (Lin et al. 2011). This is illustrated in Fig. 3 for 
one of the exons of the TMTC2 gene. While the protein-coding exon has limited 
number of substitutions and mainly results in synonymous or conservative amino 
acid changes, the introns contain many more substitutions which are inconsistent 
with protein-coding sequence evolution. The splice acceptor and splice donor sites 
are also conserved. The phylogenetic coding potential can correctly identify protein¬ 
coding exon and its boundaries along with the strand and the reading frame of the 
exon. Due to the large number of bird assemblies in the alignment, it is now possible 
to identify novel exons of genes or cases when the annotated ORF of a transcript may 
not cover the real ORF (see Fig. 4). Using the chicken genome as reference in a 
49 sauropsida alignment, the genome-wide coding potential predicts that there are 
23% more protein-coding nucleotides in the chicken and other bird genomes relative 
to what had been annotated in the Galgal4 assembly. This estimate suggests that the 
number of protein-coding genes may be much larger than the -15,000 predicted 
before (Zhang et al. 2014c; Yates et al. 2016) and this was indeed found using 
transcriptome datasets for chicken (Kuo et al. 2017). 

Although protein-coding sequences make up a significant proportion of 
constrained sites, greater than 80% of highly constrained elements were predicted 
to be non-protein-coding (Zhang et al. 2014c), and although predicting functional 
consequence may be possible (Lindblad-Toh et al. 2011), establishing the function 
would require biochemical evidence. 

Besides the fact that selective constraint is largely unaware of function, another 
technical limitation with such studies is that multiple sequence alignments were 
mainly reference species based. For example, the MULTIZ multiple sequence 
alignments are built from pairwise alignments for all possible pairs between the 
reference and target species. As a consequence, such alignments, while fully 
representing the reference genome, miss out on genomic regions shared within a 
certain clade, but not present in the reference genome. The full representation of a 
species in an alignment therefore requires that the species in question becomes in 
turn the reference, a situation largely benefiting model organisms (Blanchette et al. 
2004). For example, vertebrate alignments in Ensembl use the human genome as 
reference (Yates et al. 2016), but by doing so those homologous regions which are 
unique for murids were not present in the alignment, making it impossible to study 
lineage-specific genome evolution. The emergence of novel algorithms for creating 
(Paten et al. 2011) and storing (Hickey et al. 2013) reference-free multiple sequence 
alignments provides a better conceptual and technical framework for comparative 
genomics and enables the storage and analysis of arbitrary subclades in phylogenies 
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Fig. 3 Phylogenetic coding potential in multiple sequence alignments of birds, (a) The transcript 
model of the single transcript TMTC2 gene on the forward strand is represented in UCSC notation. 
Boxes represent exons, with taller boxes representing parts of the open reading frame. Joining lines 
represent introns, which are spliced out from the final transcript product before the sequence gets 
translated into protein, (b) Zoomed-in region of exon 8 with flanking intronic regions, (c) Codon by 
codon PhyloCSF signal on the positive strand for the three possible reading frames. Positive scores 
along the multiple sequence alignment (d) indicate potential protein-coding region in the genome 
on the positive strand in 0 reading frame, (d) The underlying 47-way bird alignment shows the 
region of interest as represented by CodAlignView. The region of exon 8 shows remarkable protein¬ 
coding conservation. Substitutions are rare within this region, and the large majority of these lead 
either to synonymous or conservative amino acid changes. Splice acceptor and donor sites flanking 
the exon are also fully conserved. Unlike the coding exon the flanking intronic regions have 
accumulated multiple substitutions and deletions without any evidence that these changes would 
preserve an amino acid coding function. Indeed many of these changes would result in stop codons 
or frameshifts if the sequence would be protein-coding 
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Fig. 4 Novel exon and coding sequence extension of the BAZ1B gene using phylogenetic coding 
potential. The positive PhyloCSF signal predicts multiple exons in the BAZ1B genic region on 
chrl9 in the galGal4 assembly. While many of these overlap with exons of the Ensembl BAZ1B 
transcript model, there are 21 novel PhyloCSF + regions which indicate novel protein-coding 
sequences. Three of these are located within the coding region identified by Ensembl and represent 
novel protein-coding exons. The rest overlap with exons which are annotated by Ensembl as non- 
protein-coding exons, so-called untranslated regions. An alternative transcript model from PacBio 
long-read and RNA-seq-based models verifies the three “within coding” exons, while PhyloCSF 
prediction on the full-length alternative transcript model suggests a much longer coding sequence 
within the transcript with alternative translation start and end sites 


with the reconstructed ancestral sequences without the loss of species- or clade- 
specific data. It is now becoming possible to store, share and visualise large 
comparative genomic datasets and mapping annotation across all species within a 
given phylogeny, therefore representing each species with its full assembly and 
annotation individually (Nguyen et al. 2014). 


5.2 Improving Annotation and Gene Organisation Using 
Long-Read Transcriptome Sequencing 

Understanding the protein-coding, structural and regulatory components of the 
transcriptome coded in the genome, along with transcript isoforms and alternative 
splicing is one of the first steps in understanding the relationship between the 
genotype and phenotype of individuals within populations. While some model 
organisms benefited from large cDNA libraries, effectively providing full-length 
transcript sequences, others got their genome annotation either based on comparative 
methods or from relatively cheap, high-throughput, short-read sequencing data. This 
latter became prevalent as a cost-effective way to provide species-specific datasets 
for multiple tissues and developmental stages with a twofold benefit (Wang et al. 
2009). First, it is possible to reconstruct the transcriptome from RNA-seq dataset, 
while it also enables the quantification of gene expression and as such helps to detect 
physiological changes between different conditions or protein-protein or other 
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regulatory interactions through the study of differential expression or the analysis of 
gene co-expression networks. Despite its benefits, understanding the transcriptome 
from short-read data is error prone. This is in part a consequence of the technology, 
which relies on relatively short, 50-150 nt long sequenced fragments of the full 
transcripts, but also because the sequencing shows compositional bias and the read 
coverage is unequal both within and between transcripts (Wang et al. 2009). Due to 
these problems, mapping of the reads is ambiguous (Engstrom et al. 2013), and 
reconstruction of the transcript isoforms is only probabilistic, software dependent 
and error prone (Steijger et al. 2013), while it is impossible to recover the exact 
transcription start and end sites. It is especially difficult to sequence genes belonging 
to large gene families or residing in repeat regions, as unique mapping of reads 
becomes almost impossible within such regions. As a consequence RNA-seq-based 
transcriptomes may comprise fewer genes than expected with lower number of 
isoforms (Kuo et al. 2017). The emergence of long-read sequencing technologies, 
especially from Pacific Biosciences (Roberts et al. 2013) and Oxford Nanopore 
Technologies (Laver et al. 2015), provides solutions to the above problems by 
enabling full-length transcript sequencing as a single read. This makes the identifi¬ 
cation of correct isoform structure unambiguous, and coupling with 5'cap and polyA 
selection enables reliable identification of transcription start and end sites. Also, 
when normalisation is applied during the RNA library preparation, sequencing of 
low-abundance transcripts becomes possible leading to an enriched transcriptome 
annotation (Kuo et al. 2017). 

As short- and long-read datasets become available, it is now possible to better 
understand the transcriptome complexity of bird genomes (Jax et al. 2018). For 
example, the number of annotated protein-coding and non-coding Ensembl genes 
and transcripts has been increased for the chicken genome (Table 2; Yates et al. 
2016), and recent estimates suggest that the chicken genome has similar number of 
genes and transcripts than the mammalian genomes (Kuo et al. 2017). 

Long-read and deep RNA-seq datasets along with improved genome assemblies 
will certainly help resolve ambiguities in gene sets of birds and will clarify the 
evolutionary history of gene families. For example, a recent analysis of bird 
genomes suggested that 274 protein-coding genes, which are present in conserved 
syntenic clusters in non-avian vertebrates and involved in important biological 


Table 2 Gallus galius transcriptome annotation statistics 


Number of genes 

Galgal2.1 (Ensembl 67) 

Galgal4 (Ensembl 85) 

Galgal5 (Ensembl 89) 

Protein-coding 

16,736 

15,508 

18,346 

Small non-coding 

1102 

1408 

1705 

Long non-coding 

0 

0 

4643 

Misc non-coding 

0 

150 

144 

Pseudogenes 

96 

42 

43 

Total number of 
transcript isoforms 

23,392 

17,954 

38,118 
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processes, had been lost during avian genome evolution (Lovell et al. 2014). 
Nevertheless an analysis of de novo assembled RNA-seq transcriptomes found 
evidence that 137 of the assumed missing genes exist in chicken and suggested 
that the reason of these genes not previously been identified was due to the poor 
coverage of GC-rich genomic regions where they are located (Bomelov et al. 2017). 

Long-read datasets may also be used to improve the annotation of closely related 
species, as full-length transcript sequences of one species can be mapped onto the 
genome of related species (Kuo et al. 2017), but with long-read sequencing getting 
more affordable, there is a real possibility that species-specific, rich transcriptome 
annotations will be made available for non-model species, effectively enabling each 
species to become “its own reference”. 


5.3 Inferring Function with "Assay-by-Sequence" Experiments 

Next-generation sequencing technology has transformed genomics research and can 
contribute to high-quality, richly annotated reference genomes. While many protein¬ 
coding genes are conserved between species, alternative splicing and the resulting 
diverse transcriptomes are significantly different between species (Barbosa-Morais 
et al. 2012) and therefore cannot be interpreted from comparative data. Similarly, a 
significant fraction of the non-coding RNA genes are highly diverse and not 
sufficiently conserved between species and so require species-specific transcriptome 
information (Ulitsky and Bartel 2013). Access to species-specific transcriptome data 
may therefore be necessary to link genotype with observed phenotypic differences. 

Although the gene annotation and information on the context-specific expression 
of genes and transcripts are important in the above process, understanding the 
regulation of these genes can also play a crucial role in linking the sequence to the 
consequence. Assaying the DNA and protein interactions (e.g. with chromatin 
immunoprecipitation followed by sequencing, i.e. ChIP-seq) or the lack of it 
(Assay for Transposase-Accessible Chromatin, i.e. ATAC-seq) may provide the 
missing functional information to understand the trait of interest. Identifying protein 
and DNA interactions can detect active promoter regions and individual transcrip¬ 
tion factor (TF) binding events in the genome as well as can identify active 
enhancers or repressors which modulate the level of expression of particular genes 
or clusters of genes (Park 2009). The associations of multiple ChIP marks at the 
same genomic location may provide detailed information on the regulatory/tran¬ 
scriptional status of genomic regions through the process called “segmentation” 
(e.g. Ernst and Kellis 2012), for example, by classifying enhancers into active, 
poised or weak enhancers. Accessible, protein-free DNA can also be assayed by, 
e.g. ATAC-seq (Buenrostro et al. 2013) which in turn can help to locate transcription 
factor binding sites (TFBS) and can define nucleosome positions. 

Although TFs in general have highly conserved DNA-binding preferences, most 
of the binding events were found to be species specific in vertebrates (Schmidt et al. 
2010), while enhancer sequences, at least in mammals, appear to be more diverse, 
arise from DNA exaptation and undergo quick functional turnover (Villar et al. 
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2015). The fact that regulatory conservation is less prevalent between species 
impedes the identification of these elements through orthology and suggests that 
detailed and reliable regulatory maps will only be obtained by doing assays at 
individual species level as opposed to relying on the annotation of a reference 
species. As large-scale biochemical assay-based functional annotation is still expen¬ 
sive and resource intensive, detailed functional maps of birds are nevertheless likely 
to first become available for reference species through the joint effort of research 
communities similar to the efforts of the ENCODE (ENCODE Project Consortium 
et al. 2012) and FAANG (Andersson et al. 2015) project consortia. 


6 Do We Still Need a Reference Organism? 

6.1 How Complete Is the Existing Chicken Reference? 

Despite all the efforts dedicated to producing a complete chicken reference genome 
and of all the progress made in sequencing technologies, progress still has to be 
made. Notably, six of the smallest microchromosomes are still missing in the most 
recent GRCg6a assembly released in 2018. This problem has been a concern since 
the very beginning of chicken genomics in the early 1990s, and although a complete 
molecular karyotype was described (Masabanda et al. 2004), the lack of large insert- 
containing BAC clones for the smallest microchromosomes has always been a 
problem for assigning any other sort of information including maps and genome 
sequence to them. Specific strategies were dedicated to the sequencing of 
microchromosomes. One was to develop radiation hybrid (RH) linkage groups 
with markers developed from chicken mRNA and expressed sequence tag (EST) 
sequences that were absent in the genome assembly or from non-localised chicken 
genomic sequence having sequence similarity to synteny blocks in human obviously 
missing in the chicken assembly. This strategy allowed the addition of chromosomes 
25 (Douaud et al. 2008) and 33 (Morisson et al. 2007; Warren et al. 2017). Another 
approach used was to develop SNP markers from non-localised sequence in the 
assembly, to build new genetic linkage groups (Warren et al. 2017). However, for 
both these approaches, chromosome assignment still requires large-insert BAC 
clones for FISH mapping, which proved often impossible to find when screening 
the current existing libraries. Also, the smallest microchromosomes being difficult to 
identify on metaphase preparations, a solution is to study them in the form of 
lampbrush chromosomes observed at the diplotene stage of meiosis (Galkina et al. 
2017). Although the exact number is still under debate, a number of genes for which 
there is transcriptome data available are clearly absent from the Galgal5 genome 
assembly, this fact having important implications on the description of the gene 
content in chicken and by extension in other bird species (Bomelov et al. 2017). 
Orthologs to many of these transcripts are grouped in the human genome, suggesting 
that chromosomal segments are absent in blocks (Bomelov et al. 2017). Specific 
regions of macrochromosomes can also cause problems, such as exemplified by the 
leptin gene, coding for a satiety hormone, a key regulator of energy balance in 
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mammals (Halaas et al. 1995). For many years, although the leptin receptor gene was 
unambiguously found in the genome, the leptin gene itself could not, and its 
existence was subject to debate for around 17 years (Pitel et al. 2010). The proof 
of its existence was only recently published, and radiation hybrid mapping localised 
it on chromosome 1 (Seroussi et al. 2017). This and many other examples show the 
interest of having a very high-quality genome assembly scrutinised by a large 
community of scientists. Work is still ongoing, and there is no doubt progress will 
be done, as chicken is one of the very rare species being member of the GRC. 


6.2 Other Birds: Specific Reference Genomes 

Other bird species have been or will be sequenced for various reasons: other poultry 
species such as turkey (Dalloul et al. 2010) or duck (Huang et al. 2013) due to their 
interest in breeding, the zebra finch because of its interest as a model organism for 
the biology of learned vocalisation (Warren et al. 2010) and more than 40 other 
species for bird phylogenomics and evolution studies (Zhang et al. 2014b). Thanks 
to lowering costs and improved technologies, the number of bird genomes 
sequenced will increase rapidly. Bird species bred for human nutrition purposes 
are mostly restrained to the gallo-anseriformes and are therefore close to chicken 
from a phylogenetic point of view. Despite this, many have their genome sequenced 
at varying levels of resolution. Turkey (Dalloul et al. 2010) and quail (Morris et al. 
2019) have genomes assembled at the chromosome level thanks to using genetic 
(turkey and quail) and physical (turkey) maps. On another hand, the duck genome 
sequence cannot yet be presented as chromosomes, as only sequence data were used 
for the assembly (Huang et al. 2013). For guinea fowl (Vignal et al. 2019), the 
assignment to chromosomes is based on alignment to the chicken genome, taking 
into account the known rearrangements observed by cytogenetic analyses. While 
having genome sequence data in a random order just as contigs and scaffolds with no 
chromosomal assignment can be sufficient for studying genes, their numbers, their 
structure or their expression, precise location along the chromosomes will be needed 
for QTL or GW AS analyses and for most genetic applications in breeding. 

New sequence technologies produce longer reads, allowing for a much higher 
genome sequence continuity, especially longer contigs that are essential for a correct 
annotation (Yandell and Ence 2012). Complementary techniques now also allow 
assembling the contigs at a chromosome level at a fraction of the cost of former 
physical mapping strategies. Optical mapping such as implemented by Bionano 
Genomics™ allows mapping by direct observation of DNA molecules fluorescently 
labelled at specific short sequence motifs defined by restriction enzyme cutting sites 
(Lam et al. 2012; Dong et al. 2013). Long-range DNA interactions detected by a 
Hi-C (chromosome conformation capture) approach on chromatin reconstituted 
in vitro with long DNA fragments and nucleosomes, such as implemented by the 
Dovetail Genomics™. Chicago™ is another long-range mapping technique 
allowing for chromosome-level assemblies (Putnam et al. 2016). Bird species benefit 
widely from these technical advances (Korlach et al. 2017), and continuous 
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improvements of the quality of reference genomes are expected. Bird genome 
sequences produced now have a much higher quality than the first-generation 
chicken genome at a fraction of the cost. Results stemming from these possibilities 
are emerging in genomics of wild species, such as exemplified by the detection of 
signatures of selection in genes related to neuronal functions and cognition in the 
great tit genome (Laine et al. 2016), to reproductive strategies in the ruff 
(Lamichhaney et al. 2016) or to migration and climate change in the yellow warbler 
(Bay et al. 2018). 


6.3 Genomics in Ecology and Evolution: With or Without a Model 
Organism 

Ideally, the best model for studying the biology of a given species will be a selection 
of individuals from the species itself, observed in the closest possible conditions to 
the natural ones. This is obviously not possible in practice for many reasons 
including biological complexity, experimental possibilities, right of access to 
biological material or even ethics. Indeed, species used as model organisms were 
chosen because they are easy to breed in experimental settings and because they 
have specific characteristics allowing to answer a range of biological questions. 
Amongst the most prominent species, some were chosen due to their phylogenetic 
proximity to human, such as the mouse; others because they allow to work on very 
specific questions in basic research, such as the yeast, the fruit fly or the nematode 
worm. Species having been used for a long time have now the added benefit of 
accumulated knowledge and material, such as experimental protocols or cell lines 
and more recently high-quality whole genome sequences. A large consortium of 
research groups worked on producing the first-generation chicken genome allowing 
for rapid progress in bird genome biology. With reduced costs and improved 
sequencing quality, many birds are now sequenced or will be in a short time, thus 
challenging the status of chicken as prominent model organism. For instance, in 
embryology, quail is emerging as a new model as it is smaller and has a shorter 
generation time than chicken (Huss and Lansford 2017), but one major limitation for 
genomics in this species was the absence of a reference genome, a problem which 
has now been addressed with a first-generation assembly now available and 
annotated (Morris et al. 2019). In this specific example, both chicken and quail are 
used to answer potentially similar questions in embryology, but in other fields of 
biology, different model species will be required, such as the zebra finch for the 
biology of learned vocalisation (Warren et al. 2010). With the advent of cheap 
genome sequencing, access to genomic data information from an ever-increasing 
number of model organisms will be possible (Kraus and Wink 2015; de Magalhaes 
2015). 

Whole genome sequencing improves the resolution of phylogenetic trees (Jarvis 
et al. 2014), but although in theory all species should be considered equally in such 
analyses, differences in annotation levels complicate the task, and a uniform annota¬ 
tion had to be made using chicken and zebra finch as references (Jarvis et al. 2014). 
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Given the difficulties in obtaining species-specific RNA datasets and other annota¬ 
tion datasets based on biological samples for many bird species, the annotation of a 
limited number of model species will remain essential, and one option would be to 
choose such a minimal set of reference species. Moreover, when comparing gene 
counts between bird species and between avian species and other vertebrates, care 
must be taken about genes that will be missing for purely technical reasons, such as 
the ones potentially present on the six microchromosomes still missing in chicken. 
Although they are small and may contain only a limited number of coding genes, 
some selection pressure has acted towards their conservation. Having a real idea of 
the typical gene content in birds may therefore not be trivial, requiring more 
improvements in the chicken assembly, and the fact that the exact number of 
genes is still not exactly known even in the human genome (Pertea et al. 2018) 
highlights the importance of having at least one bird as part of the GRC. 

There have been cases in which the sequencing of a genome has allowed directly 
to answer and identify a genome modification responsible for a phenotypic variation 
without using a model organism. For instance, genome sequencing of the ruff 
Philomachus pugnax , followed by low-coverage re-sequencing of males having 
different genetically defined mating behaviour morphs, allowed the identification 
of a 4.4 Mb causative inversion (Lamichhaney et al. 2016). However, would a 
subtler mutation like a discrete SNP in a regulatory region have been identified 
without any prior knowledge on epigenetic marks, such as currently sought for in 
chicken by the FAANG consortium? Indeed, quite often, causative genes in birds 
were discovered because they were candidate genes stemming from work in chicken. 
For instance, in quail, the SLC45A2 gene was shown to cause colour variation as in 
chicken (Gunnarsson et al. 2007), the MITF gene to cause the silver phenotype 
(Minvielle et al. 2010) and the MLPH gene the lavender plumage colour (Bed’hom 
et al. 2012). Making the matter even more complex, many traits involved in adapta¬ 
tion and behaviour that are studied in ecology will be polygenic, subject to environ¬ 
mental factors or be caused by subtle differences in regulatory regions of genes that 
will be difficult to point out. For instance, a SNP in the 3' untranslated region of the 
circadian clock period gene in drosophila affects the thermal plasticity in its midday 
siesta by affecting the thermosensitive splicing of its intron 8 (Cao and Edery 2017). 
This was discovered thanks to drosophila being a well-studied model organism and 
emphasises once again the need for a high-level annotation of regulatory elements in 
several species by ChIP-seq or by other means. This being only possible for a 
handful of species, phylogenomics approach is expected to benefit a wider range, 
especially if any species genome can be used as the reference. How long will it take 
to go from lists of genes possibly associated to traits, such as seen in the great tit and 
the yellow warbler (Laine et al. 2016; Bay et al. 2018) to their confirmation and the 
elucidation of the underlying mechanism? And will this be possible without using a 
model organism available in a laboratory setting? For instance, the implication of 
BMP4 in morphological variation in Darwin’s finches was confirmed by laboratory 
experiments on chicken embryos (Abzhanov et al. 2004). The latest gene editing 
technologies now allow a wider range of species to be manipulated, but there again, a 
choice of models will have to be made for practical reasons. 
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Abstract 

An outstanding feature of avian karyotypes is an extraordinary degree of apparent 
similarity from one species to the next, with the majority of avian species 
exhibiting 2 n = 74-86. Several exceptions to this rule include avian clades that 
have a large degree of chromosomal fusion and fission. In this chapter we 
describe patterns of avian chromosomal evolution, including likely associations 
between karyotype evolution and phenotype. We also describe novel approaches 
that will facilitate avian chromosome studies at molecular level to unravel the 
mystery of the significance of this very distinctive genomic structure. 
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1 Overall Genome Structure 

The overall genome structure of birds has several unique features. First, birds 
possess the smallest genomes among amniotes, averaging 1.35 Gbp (Giga base 
pairs) and ranging from 0.9 Gbp in the black-chinned hummingbird [Archilochus 
alexandri\ (Gregory et al. 2009)] to 2.1 Gbp in the ostrich [Struthio camelus ; (Scanes 
2014)]. This reduction in size from that of their forebears has been hypothesised to 
reflect an adaptation to the high metabolic requirements of powered flight (Gregory 
2002; Hughes and Friedman 2008). This is, in part, supported by the fact that 
flightless birds have larger genomes than flying birds and bats’ genomes are smaller 
than their mammalian sister groups (Gregory 2005). This notion has however been 
challenged in that comparative analyses also suggest that the evolution of compact 
genomes may have occurred before the emergence of flight (Tiersch and Wachtel 
1991; Organ et al. 2007). Regardless of when it occurred, the compactness of avian 
genomes in extant species is characterised by a low fraction of repetitive DNA, 
shorter genes and non-coding regions and by the extensive loss of gene family 
members (Zhang et al. 2014). 

Considering each of these in turn, the fraction of repetitive elements in avian 
genomes, including transposable elements (TEs), varies from 4 to 22%. These are 
very low values when compared to the 35-52% found in mammalian genomes 
(Lander et al. 2001; Zhang et al. 2014). Indeed, 47 out of the 48 avian genomes 
analysed by the Avian Phylogenomics Consortium had a fraction of TEs below 10%. 
The exception was the downy woodpecker ( Picoides pubescens) with 22% of its 
genome comprising TEs resulting from an either species- or lineage-specific expan¬ 
sion of LINE-CR1 (long interspersed elements; chicken repeat I) transposons (Zhang 
et al. 2014). Interestingly, even though genomes sequenced with short next- 
generation sequencing (NGS) technologies are known to have an underrepresenta¬ 
tion of repetitive sequences, the data reported by Farre and collaborators (2016) 
demonstrates that budgerigar ( Melopsittacus undulatus ), common cuckoo ( Cuculus 
canorus ) and downy woodpecker genomes (assembled with NGS) had more 
transposable elements detected in their assemblies than chicken ( Gallus gallus ), 
zebra finch ( Taeniopygia guttata) and turkey ( Meleagris gallopavo) that were 
sequenced with traditional sequencing techniques or a combination of NGS and 
traditional techniques (Farre et al. 2016). The average total length of SINEs (short 
interspersed elements) in birds is also much lower than that of other reptiles with a 
total length of -1.3 Mbp (Mega base pairs) compared to -12.6 Mbp in the alligator 
and -34.9 Mbp in green sea turtle ( Chelonia my das), likely indicating that the avian 
lineage underwent a reduction in the number of SINEs (Zhang et al. 2014). When 
considering gene size, avian genomes have shorter genes (50% shorter than mam¬ 
malian genes and 27% shorter than other reptilian genes), mainly due to a reduction 
of intron sizes and intergenic regions. This compression, shared with bats, may result 
from a reduced number of TEs within these regions (Zhang et al. 2013, 2014). 
Moreover, bird genomes show an overall reduction in the number of gene family 
members when compared to other vertebrates, and the loss of these paralogs seems 
to correlate with segmental deletions associated with large-scale structural 
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rearrangements ancestral to the avian lineage (Hughes and Friedman 2008; Lovell 
et al. 2014; Warren et al. 2016). While all the above features characterise avian 
genomes, it is perhaps the organisational structure of the genome, i.e. the karyotype, 
that is the most peculiar. 


1.1 Karyotype Structure 

The avian karyotype is near-unique in nature, comprising more chromosomes than 
any other vertebrate (approx. 2 n = 80). Sizes of avian chromosomes vary from 
200 Mbp to as low as 3.4 Mbp (Ellegren 2010), roughly divided into -10 pairs of 
macrochromosomes, comparable in size with mammalian chromosomes and com¬ 
prising approximately 70% of the avian genome, and around 30 pairs of almost 
evenly sized, morphologically indistinguishable microchromosomes (Christidis 
1990; Masabanda et al. 2004; Griffin et al. 2007). The morphological similarity of 
the microchromosomes makes them almost impossible to distinguish by classic 
cytogenetic techniques such as karyotyping, as they appear as tiny “dots” using 
light microscopy. Nonetheless, there have been attempts at generating (albeit partial) 
karyotypes from around 1000 species [most (800+) reported in Christidis (1990)], 
approximately 10% of all birds currently described. 

Size is not the only feature distinguishing macro- and microchromosomes. 
Microchromosomes are GC-rich, gene dense (while accounting for only 23% of 
the genome but 48% of the genes) and exhibit higher nucleotide substitution rates 
and recombination rates than macrochromosomes (Smith et al. 2000; Habermann 
et al. 2001; Burt 2002; Axelsson et al. 2005). The presence of microchromosomes is 
not unique to birds, as they share this genomic feature with turtles and lizards, but 
not with crocodilians, the sister clade of birds (Olmo 2008). Nonetheless, birds have 
smaller microchromosomes in greater numbers than any other vertebrate. Burt 
(2002) proposed that the vertebrate ancestor genome contained microchromosomes 
suggesting that the characteristic avian karyotype was established at an extraordinary 
early stage of evolution, with the absence of microchromosomes in crocodiles 
arising by fusion of micro- and macrochromosomes after the crocodilian-bird 
lineages diverged (Ellegren 2010) —see later section on ancestral karyotypes. 

The majority of avian chromosomes are acrocentric (centromere locates at the 
chromosome end with almost undetectable p-arm) (Christidis 1990; Griffin et al. 
2007), and most chicken centromeres contain long [>100 Kbp (Kilo base pairs)] 
arrays of chromosome-specific simple repeats. However, chicken chromosome 
5, 27, and Z centromeres are remarkably short (-30 Kbp) and lack the usual repeat 
structure (Shang et al. 2010). Moreover, centromere repositioning and the formation 
of neo centromeres have been observed (Zlotina et al. 2012). As in other vertebrates, 
avian telomeres possess the canonical TTAGGG repeat motif, but they constitute 
large repeat blocks that can be up to 4 Mbp in length (Delany et al. 2007; O’Hare and 
Delany 2009). 
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Fig. 1 Chromosome complement variation in vertebrates. Chromosome numbers were obtained 
from Gregory (2017), Be£ak et al. (1971), Benirschke et al. (1973, 1975) 


1.2 Diploid Number 

A surprising feature when comparing most avian karyotypes, despite the overall 
large diploid number, is an apparent similarity in terms of their sizes and genomic 
content between species. More than 60% of all avian karyotypes contain a diploid 
number of 74-86 chromosomes (Christidis 1990; Griffin et al. 2007) (Fig. 1), and 
chromosomal painting shows that interchromosomal rearrangements are extremely 
rare during avian evolution, at least among the larger chromosomes. There are 
however exceptions to this rule, with, among others, the stone curlew (Burhinus 
oedicemus) having 2 n = 40 and the Southern go-away bird (Corythaixoides 
concolor ) having In = 142 (Christidis 1990; Griffin et al. 2007) (Table 1). Intri- 
guingly, deviations from the typical avian chromosome number are limited to few 
avian groups, such as penguins, some birds of prey (i.e. falcons and eagles) and 
parrots (de Oliveira et al. 2005; Griffin et al. 2007; Nanda et al. 2007; Nishida et al. 
2008) (Table 1). It is apparent that a diploid number reduction could be achieved 
(such as in falcons and parrots) from microchromosomal tandem fusions and/or the 
fusion of micro- and macrochromosomes (de Oliveira et al. 2005; Griffin et al. 2007; 
Nanda et al. 2007; Nishida et al. 2008; Nie et al. 2009; Damas et al. 2017). For 
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Table 1 Examples of avian karyotypes with extreme deviations from the typical avian 2 n « 80 


Order 

Species 

Common name 

Diploid 
number (2 n) 

References 

Charadriiformes 

Burhinus 

oedicnemus 

Stone curlew 

40 

Christidis 

(1990) 

Charadriiformes 

Esacus 

magnirostris 

Beach thick 
knee 

40 

Christidis 

(1990) 

Falconiformes 

Falco 

columbarius 

Merlin 

40 

Nishida et al. 
(2008) 

Falconiformes 

Falco peregrinus 

Peregrine falcon 

50 

Nishida et al. 
(2008) 

Psittaciformes 

Melopsittacus 

undulatus 

Budgerigar 

58 

Nanda et al. 
(2007) 

Piciformes 

Picoides 

pubescens 

Downy 

woodpecker 

92 

Shields et al. 
(1982) 

Sphenisciformes 

Pygoscelis 

adeliae 

Adelie penguin 

96 

Fedesma et al. 
(2003) 

Coraciiformes 

Ramphastos toco 

Toco toucan 

106 

Venturini et al. 
(1986) 

Coraciiformes 

Alcedo atthis 

River kingfisher 

138 

De Smet (1981) 

Musophagiformes 

Corythaixoides 

concolor 

Southern 

go-away-bird 

142 

Christidis 

(1990) 


example, a detailed analysis of the atypical avian karyotype of the peregrine falcon 
(Falco peregrinus ) revealed that this species presents chromosomes comprising as 
many as four fused ancestral chromosomes each (Damas et al. 2017). 

This similarity in karyotypes of multiple species across the phylogenetic Class 
distinguishes birds from other clades, e.g. mammals and lizards, where interchromo- 
somal rearrangements are relatively common (Organ et al. 2008; Graphodatsky et al. 
2011; Carvalho et al. 2015) and chromosome numbers are more variable (Fig. 1). For 
instance, in mammals, chromosome number can vary enormously between species 
of the same genus. An extreme example of this can be seen in the muntjacs where the 
Chinese muntjac (.Muntiacus reeve si) has a diploid number of 46, while the Indian 
muntjac (Muntiacus muntjak ) has a diploid number of 6 (7 in males) (Wurster and 
Benirschke 1970), illustrating that in closely related mammals (~5 MY divergence 
time), karyotypic variation can be very high. 


1.3 Sex Chromosomes 

Opposite to the situation in mammals, in birds, females are the heterogametic sex, 
and males are homogametic. Females thus have one copy of the Z chromosome and 
one chromosome W that mostly do not undergo genetic recombination, and males 
have two copies of chromosome Z in each cell (Graves 2014). This is just one 
example of sex chromosome heterogamety, which evolved independently many 
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times as it can also be found in butterflies, fishes, non-avian reptiles, and amphibians 
(Matsubara et al. 2006). 

In Neognathae, the Z and W chromosomes are differentiated in terms of size with 
the chromosome W being largely heterochromatic, gene poor and significantly 
smaller than the Z. Ratites, however, have a W chromosome which is of a similar 
size to the Z and is homologous in its entirety, except, in the case of emus, of a small 
region near the centromere (Shetty et al. 1999). It can thus be inferred that the ZW 
system was evident prior to the divergence of the Palaeognathae and Neognathae 
lineages (Deakin and Ezaz 2014). Interestingly, Rutkowska and collaborators (2012) 
demonstrated that unlike the mammalian Y chromosome, the avian W chromosome 
is not subjected to gradual length reduction over evolutionary time. Indeed, closely 
related avian species could have W chromosomes that differ significantly in size due 
to variation in the length of non-coding regions (Rutkowska et al. 2012). In chickens, 
the Z chromosome is 80 Mbp in size and contains around 1000 genes making it far 
less gene dense than the autosomes. It also has 60% more interspersed repeats than 
the autosomes making it different from the rest of the chicken genome (Bellott et al. 
2010) but interestingly, similar (in terms of relative gene and repeat content) to the 
mammalian chromosome X (Graves 2014). Although morphologically resembling 
the XY system seen in mammals (with the exception of the ratites of which the Z and 
W are indistinguishable from each other cytogenetically), the two systems exhibit no 
homology (Nanda et al. 1999). In fact, the avian Z chromosome shares homology 
with human autosomes 5, 9, and 18 (Bellott et al. 2010), and the human X 
chromosome shares homology with a block of the q-arm (long arm) of chicken 
chromosome (GGA) 1 and 20 Mbp of the p-arm (short arm) of chicken chromosome 
4 (an independent microchromosome in most birds) (Ross et al. 2005). There also 
appears to be no homology between the majority of non-avian reptile ZW and avian 
ZW systems. The Japanese four-striped rat snake (Elaphe quadrivirgata ), for exam¬ 
ple, has a ZW system that shares homology with GGA2 not Z or W (Matsubara et al. 
2006). The exception to this rule appears to be the gecko lizard system (Gekko 
hokouensis) which exhibits homology with the chicken Z chromosome (Kawai et al. 
2009). The ZW system is not a feature that is present among all reptiles. Both XY 
and ZW sex chromosome systems are observed in lizards and turtles, with lizards 
showing the largest degree of variability among reptiles (Ezaz et al. 2017). Some 
reptiles such as crocodiles also exhibit temperature-dependent sex determination 
(TSD) with the Gekkonidae lizard family being particularly unusual in exhibiting all 
three types of sex determination (O’Meally et al. 2012). 

Unlike in mammals, in chicken at least the sex-determining gene is not SRY 
(sex-determining region Y). In fact, the homologous gene of SOX3 (SRY-like high 
mobility group box-containing gene 3), on the mammalian chromosome X that 
evolved into SRY, lies on chicken chromosome 4 among other genes that are 
X-bome in mammals (Graves 2013, 2014). It has been suggested that the gene 
DMRT1 (doublesex and mab-3 related transcription factor 1) found on the chicken Z 
chromosome may be the key to sex determination (Smith et al. 2009). The system is 
dosage-dependent meaning that male determination requires two copies of the gene. 
DMRT1 has also been shown to be required for testis formation (Smith et al. 2009). 
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There is, however, still much debate as to what determines sex in birds. Possible 
candidates also include W-specific genes that may determine ovarian function 
(Graves 2014), among other theories. Improvements in the assembly of the Z 
chromosome achieved using a bacterial artificial chromosome (BAC by BAC) 
approach (Bellott et al. 2010) along with work underway to improve the assembly 
of the W chromosome (Chen et al. 2012; Smeds et al. 2015; Tomaszkiewicz et al. 
2017) will assist with resolving these questions. 


2 An Absence of Inter-macrochromosomal Rearrangement 

The generation of chromosome-specific DNA probes (paints) for chicken 
macrochromosomes (chromosomes 1-9, Z, and W) not only facilitated the 
characterisation of the chicken karyotype (Griffin et al. 1999; Habermann et al. 
2001; Masabanda et al. 2004) but also led to a surge in avian comparative genomics 
research [e.g. Nanda et al. (1999), Shetty et al. (1999), Raudsepp et al. (2002), 
Shibusawa et al. (2002), Itoh and Arnold (2005), Griffin et al. (2007), Nishida et al. 
(2008), Nanda et al. (2011), Kim et al. (2013)]. Avian cross-species chromosome 
painting showed a high degree of synteny, even between distantly related avian 
lineages such as chicken, emu (Dromaius novaehollandiae ), and zebra finch that 
diverged -90 MYA (Shetty et al. 1999; Guttenbach et al. 2003; Derjusheva et al. 
2004; Romanov et al. 2014). This conservation is also noticeable with non-avian 
reptiles such as crocodiles and turtles, which diverged from a common ancestor with 
birds -230 MYA (Matsuda et al. 2005; Kasai et al. 2012; Pokorna et al. 2012). 

The use of microchromosomal paints, however, has been relatively limited 
(Griffin et al. 1999; Shetty et al. 1999; Hansmann et al. 2009; Nie et al. 2009) 
largely due to the fact that paints are represented by “pools” of microchromosomes, 
i.e. probes that recognise more than one chromosome pair rather than being assigned 
to separate, entire chromosomes. Chromosome paints are initially generated by flow 
cytometry of a suspension of chromosomes, and the inability to resolve smaller 
chromosomes by this approach means that multiple chromosome pairs are isolated in 
the flow karyotype (Lithgow et al. 2014). Still, the limited number of studies using 
chicken microchromosome paints on other bird karyotypes showed that, in most 
cases, synteny is also conserved among these chromosomes, with most chicken 
microchromosomes represented as a single chromosome in other species (Deakin 
and Ezaz 2014). Exceptions to this rule are found for the species with an atypical 
chromosome number (e.g., falcons and parrots), where, as mentioned above, the 
reduction in the number of chromosomes resulted from microchromosome tandem 
fusions or the fusion of micro- and macrochromosomes (de Oliveira et al. 2005; 
Griffin et al. 2007; Nanda et al. 2007; Nishida et al. 2008; Nie et al. 2009; Damas 
et al. 2017). 

Molecular cytogenetic techniques give valuable insights into the detection of 
interchromosomal rearrangement (or lack of it) in avian chromosomes; however, 
their relatively low resolution means that they are limited in their usefulness in 
assessing if this stability extends to intrachromosomal rearrangement. Indeed, the 
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comparison of high-resolution physical maps and genome sequences has begun to 
establish that unlike interchromosomal events, intrachromosomal ones 
(e.g. chromosomal inversions) were relatively frequent during avian evolution 
(Volker et al. 2010; Skinner and Griffin 2012; Zhang et al. 2014) (Fig. 2). These 
studies clearly demonstrated the importance of the availability of assembly data for 
the detection of the full collection of events that shaped genomes through evolution. 
The recent comparison of 21 sequenced avian genomes revealed that during evolu¬ 
tion, bird chromosomes had an average intrachromosomal rearrangement rate of 
-1.25 evolutionary breakpoint regions per MY (EBRs/MY) (Zhang et al. 2014), 
which is higher than the -0.35 EBRs/MY reported for mammals (Farre et al. 201 1). 
These evolutionary rates, coupled with the smaller genome sizes in birds compared 
to mammals, show that despite their karyotypic stability, avian genomes have a 
higher density of genome rearrangements than, for instance, mammals and suggest 
that intrachromosomal rearrangements might be significant contributors to the phe¬ 
notypic diversity presented by the members of this Class. Furthermore, rearrange¬ 
ment rates are highly variable between avian lineages. For instance, the origin of 
Neognathae was accompanied by an elevated rate of chromosomal rearrangements, 
-2.87 EBRs/MY (Zhang et al. 2014). Interestingly, vocal learning species [i.e. zebra 
finch, medium ground finch (Geospiza fortis ), American crow (Corvus 
brachyrhynchos ), budgerigar, Anna’s hummingbird (Calypte anna)] show higher 
rearrangement rates than their non-vocal learning relatives [i.e. golden-collared 
manakin (Manacus vitellinus ), peregrine falcon, chimney swift (Chaetura pelagica); 
P = 0.0499] and even higher than all the other non-vocal learning species 
(P = 0.004), which might relate with the larger radiations these clades experienced 
relative to the other bird groups (Zhang et al. 2014). Nonetheless, due to the dearth of 
chromosome-level assemblies for most of the studied species, these rearrangement 
rates could still be underestimated, as evolutionary breakpoint regions (EBRs) 
located between scaffolds would not be included in the analysis, or be inaccurate 
as assembly errors could be misinterpreted as EBRs. 


3 Genome Stability Models 

Karyotype differences between species arise from DNA aberrations in germ cells 
that were fixed during evolution. As with any other mutation, differences in chro¬ 
mosomal rearrangement rates between lineages can result from either changes in 
their rate of mutation or their rate of fixation (Burt 2001). The level of evolutionary 
chromosome rearrangement observed in extant species is an outcome of “arm¬ 
wrestling” between these two rates. The study of factors leading to changes in 
mutation and fixation rates therefore might help us understand the variability of 
rearrangement rates between avian clades and why most avian genomes do not 
appear to undergo a great deal of interchromosomal rearrangement when compared 
to, for example, mammals and non-avian reptiles but nonetheless undergo significant 
intrachromosomal rearrangement (Griffin et al. 2007; Skinner and Griffin 2012). 
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Fig. 2 Chicken chromosome 1 and 17 representations on the Evolution Highway comparative 
chromosome browser (http://eh-demo.ncsa.uiuc.edu/birds/). Blue and pink blocks define homolo¬ 
gous synteny blocks, in “+” and ” orientation, respectively, compared to the reference chicken 
chromosomes for six avian genomes assembled to chromosome levels. Numbers within blocks 
depict corresponding chromosome numbers in each of the species 
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3.1 Generation Time 

Generation time and the repetitive DNA content of a species genome are believed to 
be significant contributors to mutation rate variability. The shorter generation times 
of birds, relatively to mammals, for instance, could imply a higher propensity to the 
occurrence of genome rearrangements as a result of a higher number of undergone 
meiosis and associated crossovers (Brown et al. 2018). As crossovers require the 
production of double-strand breaks (DSBs) that could be misrepaired, chromosomes 
would then have a higher chance to rearrange. In fact, generation time was previ¬ 
ously correlated with the differences in rearrangement rates between human and 
mouse (Burt et al. 1999), where the short-lived mice present rearrangement rates at 
least twice as high as that of humans. The same reasoning could also explain, at least 
partially, why the lineages leading to the short-generation-time songbirds (Nam et al. 
2010) demonstrate some of the highest rearrangement rates on the avian phyloge¬ 
netic tree. Interestingly, there are some limited data showing that the recombination 
rates observed in birds [1.7-2.6 cM/Mbp; (Pigozzi 2016)] are higher than those of 
eutherian mammals [0.2-1.8 cM/Mbp; (Segura et al. 2013)], which could also imply 
higher mutation rates in avian genomes. Nonetheless, more mammalian and avian 
genomes need to be included in the analysis, and a direct comparison of homologous 
sites for mutation rates would need to be performed to prove this hypothesis. 


3.2 Repetitive Sequences 

Repetitive DNA sequences [e.g. segmental duplications (SDs), transposable 
elements (TEs) and tandem repeats (TRs)] are often used as a template for 
non-allelic homologous recombination (NAHR) and because of that are considered 
important contributors to genome evolution in eukaryotes (Gregory 2005; Wessler 
2006; Lynch 2007). In this respect, the low repetitive content of avian genomes 
could represent fewer opportunities for avian genomes to change due to a shortage of 
templates for NAHR (Burt 2002; Ellegren 2010) and would have an opposite effect 
on the rate of rearrangements compared to generation time presented in the previous 
section. Indeed, EBRs are associated with repetitive sequences in many animal 
groups. Lineage-specific EBRs were previously found enriched for SDs, TRs, and 
long terminal repeats (LTRs) in mammals (Murphy et al. 2005; Farre et al. 2011; 
Groenen et al. 2012), yeasts (Chan and Kolodner 2011), and Drosophila (Puerma 
et al. 2016), among others. In birds, lineage-specific EBRs are usually enriched in 
LTRs (Skinner and Griffin 2012; Farre et al. 2016), and the genomes of songbirds 
show both an expansion of LTR elements and higher rearrangement rates than other 
avian clades (Zhang et al. 2014; Kapusta and Suh 2017). 

The repeat-poor nature of avian genomes was previously hypothesised to con¬ 
tribute to the maintenance of independent chromosomes after breakage because of a 
lack of non-allelic homologous sequences to remerge broken DNA fragments into 
new chromosomes (Burt 2002). This hypothesis could explain the maintenance of 
many small chromosomes in the avian lineage and was used to explain the larger 
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diploid number of bird karyotypes, when compared to non-avian reptiles (Burt 
2002). Nonetheless, many avian species with highly rearranged karyotypes show a 
predominance of chromosomal fusions, as shown, for instance, for the peregrine 
falcon (Damas et al. 2017), parrots (Nanda et al. 2007) and eagles (de Oliveira et al. 
2005). The higher repeat content of, for instance, mammalian and amphibian 
genomes, when compared to birds, aligns with a much higher frequency of chromo¬ 
somal fusion endorsing the role of repetitive sequences for chromosome merging 
(Voss et al. 2011; Uno et al. 2012). However, the same pattern is not clearly 
observed within birds where higher levels of rearrangements in the genomes of 
falcons, penguins, and parrots do not seem to correlate with higher repetitive 
contents (Damas et al. 2017). On one hand, this lack of correlation between the 
number of repetitive sequences and the number of EBRs could be concealed by an 
underrepresentation of repetitive elements in NGS genome sequences. On the other 
hand, it is also possible that as was observed in the highly rearranged gibbon genome 
(Carbone et al. 2014), the formation of chromosomal rearrangements could be 
associated with segmental duplications or unidentified clade-specific TE families 
or even result from repair mechanisms that do not require homologous templates, 
such as non-homologous end joining (Moore and Haber 1996). 


3.3 Functional Constraints 

Independently of the rate of mutation observed in a lineage, after a chromosomal 
rearrangement occurs, it will only influence the evolution of species if it is fixed or 
nearly fixed (polymorphism). The probability of fixation of a chromosome rear¬ 
rangement can increase (a) simply by chance (i.e. genetic drift) in small or inbreed 
populations, (b) if one of the chromosomal variants is transmitted at higher rate to 
the descendants (i.e. meiotic drive), or (c) if the novel chromosome variant has 
selectively advantageous implications (Burt 2001). Nonetheless, in large, randomly 
mating populations, chromosome rearrangements have a higher probability of being 
fixed if they have neutral, nearly neutral, or selectively advantageous implications 
(Burt 2001). This way, the relative compactness of avian genomes might be one of 
the main contributors to their stability, by reducing the chances of fixation of 
chromosome rearrangements. 

As mentioned in the opening section, genome compactness was previously 
hypothesised to relate to the high metabolic demands of powered flight (Gregory 
2002; Hughes and Friedman 2008). Indeed, this theory is supported both by the 
smaller genome sizes of bats, when compared to other mammals (Gregory 2005), 
and the genome size variation within the Class Aves. Flightless birds [e.g. ostrich 
0 Struthio camelus)] have the largest avian genomes, and hummingbirds that have the 
highest metabolic rates among birds have the smallest (Wright et al. 2014). None¬ 
theless, having compact genomes with higher gene and regulatory elements density, 
together with a lower number of gene family members, can also have some 
drawbacks. One of them would be a lower tolerance to rearrangement due to a 
higher probability of a rearrangement having significant biological implications 
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that would not be tolerated by selection. The fact that avian genomic regions 
that have large number of collinear conserved DNA sequences [multispecies 
(ms) homologous synteny blocks (HSBs)] were found significantly enriched for 
conserved non-coding elements [CNEs; (Farre et al. 2016)] supports this hypothesis. 

CNEs are non-coding sequences evolving slower than at neutral substitution rate, 
many of which are known to be regulatory elements (e.g. sites for transcription 
regulatory factors) (Woolfe et al. 2004; Babarinde and Saitou 2016). Thus, the 
disruption of CNE-dense regions could have a substantial impact on gene regulation 
pathways. As birds have both a higher fraction of their genomes within CNEs [-7%; 
(Zhang et al. 2014)] compared to mammals [-4%; (Lindblad-Toh et al. 2011)] and 
smaller genome sizes, genome rearrangements might have a higher likelihood of 
causing significant (and/or deleterious) functional implications, which will affect 
their chances of fixation. In support of this hypothesis, genomic regions harbouring 
lineage-specific EBRs demonstrate a lower CNE density than their immediately 
adjacent regions (Damas et al. 2017). Moreover, EBRs flanking intrachromosomal 
rearrangements have a higher fraction of CNEs than those EBRs that flank chromo¬ 
somal fusions and fissions (Damas et al. 2017). This observation agrees with the idea 
that interchromosomal rearrangements might have more drastic effects on cis gene 
regulation (Skinner and Griffin 2012; Romanov et al. 2014), and because of that, 
their fixation would be further restricted. Indeed, interchromosomal changes were 
rarely fixed in avian genomes and are limited to few avian lineages and very few 
cases in most genomes (Griffin et al. 2007). In contrast, the fact that larger genomes, 
with longer non-coding regions, lower gene and CNE density, and higher repetitive 
content (e.g. those of mammals and non-avian reptiles) tend to have higher rates of 
interchromosomal rearrangement further supports this hypothesis. It is also notewor¬ 
thy that in avian genomes when interchromosomal change occurs, it tends to recur at 
the same site. For instance, the fusion of ancestral chromosomes 4 and 10 appears to 
have occurred independently during the evolution in chicken, greylag goose (Anser 
anser ), collared dove (Streptopelia decaocto) and other species (Fig. 3) (Griffin et al. 
2007) suggesting that there are either a limited number of places in the avian genome 
where ancestral chromosomes can merge without any deleterious effects for the 
carrier of the fused chromosome or presence of homologous sequence templates 
which make fusion of ancestral chromosomes 4 and 10 likely to re-occur. 


4 Ancestral Karyotypes 

The first reconstructions of ancestral genome structures were based on 
low-resolution karyotype comparisons using chromosome painting. These studies 
offered the first insights into the genome rearrangements that shaped extant genome 
structures. As mentioned previously, avian cross-species chromosome painting 
showed a high degree of synteny both between avian lineages (Shetty et al. 1999; 
Guttenbach et al. 2003; Derjusheva et al. 2004; Romanov et al. 2014) and between 
birds and non-avian reptiles (Kasai et al. 2012; Pokoma et al. 2012), making it 
relatively easy to predict some key features of the ancestral avian karyotype using 
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Fig. 3 Seven-colour 
chromosome painting 
illustrating syntenies between 
chicken (Gallus gallus) and 
goose (Anser anser). Like 
chicken, the goose also has a 
“fused” chromosome 
4. Colour code: chromosome 
1 in cyan, 2 in yellow, 3 in 
blue, 4 in green, 5 in magenta, 
Z in orange and W in white 



cytogenetic data. Using this approach, Griffin and collaborators (2007) reconstructed 
an avian ancestral karyotype representing chicken chromosomes 1-9 plus Z (Griffin 
et al. 2007). Only one difference between the avian ancestor and the chicken 
karyotype was detected, where chicken chromosome 4 results from the fusion of 
avian ancestor chromosomes 4 and 10 (Griffin et al. 2007). This karyotypic 
organisation is maintained in most avian lineages studied so far. The karyotypes of 
Passeriformes also only show one difference to the proposed avian ancestral karyo¬ 
type where the ancestral chromosome 1 is split into two independent chromosomes. 
Galliformes (e.g. chicken, quails, pheasants, and turkey) also underwent only a few 
fusions or fissions, and most ancestral chromosomes were maintained intact (Griffin 
et al. 2007). The exceptions to this pattern are the birds of prey, where multiple 
rearrangements were identified (de Oliveira et al. 2005). 

As reviewed above, the low resolution of cytogenetic methodologies only permits 
the detection of interchromosomal rearrangements, making the FISH-based compar¬ 
ative maps of limited use for reconstructing karyotypes of older ancestors 
(e.g. eutherian and amniote ancestors). Sequenced genome comparisons can, how¬ 
ever, not only expand the evolutionary depth of ancestral karyotype reconstructions 
but also facilitate the detection of the full range of rearrangements that shaped 
species genomes through evolution. Nonetheless, most algorithms developed to 
perform the reconstruction of ancestral karyotypes based on genome sequence data 
[e.g. InferCARs (Ma et al. 2006) or ANGES (Jones et al. 2012)] require 
chromosome-level genome assemblies as inputs, which together with the shortage 
of chromosome-level genome assemblies available for birds restricted the study of 
chromosome evolution in this clade. Indeed, only recently the first sequence-based 
avian ancestor genome structure was proposed (Romanov et al. 2014). Romanov and 
colleagues used information from the alignments of chicken, turkey, duck (Anas 
platyrhynchos ), zebra finch, ostrich, and budgerigar genomes to reconstruct 
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chromosome structure for each macrochromosome and several microchromosomes 
(Romanov et al. 2014). They noted that the inter- and intrachromosomal changes 
leading to the genome organisation of the six extant species’ genomes would be best 
explained by a series of inversions and translocations with common breakpoint reuse 
(Romanov et al. 2014). They also observed that chicken had the lowest number of 
chromosomal rearrangements compared to the avian ancestor and budgerigar and 
zebra finch had the highest. Moreover, microchromosomes seemed to represent 
conserved blocks of synteny (Romanov et al. 2014). These results also suggest 
that there are mechanisms in action to maintain the stability of the avian karyotype. 
In later sections, we will learn more about the history of avian and ancestral 
karyotypes when novel algorithms that are designed to work with fragmented 
genomes (e.g. the DESCHRAMBLER (Kim et al. 2017) are applied to the currently 
available genomes. We will also learn how new cost-effective sequencing and 
scaffolding technologies (see below) will allow to generate scaffolds or contigs of 
a chromosome (or chromosome arm) length (Korlach et al. 2017) and how these 
assemblies are compared to infer evolutionary histories of individual avian 
chromosomes. 


5 Implications of Chromosome Rearrangements 

Chromosome rearrangements are known to play a role in the generation of genetic 
and phenotypic diversity (Chen et al. 2013). Genomic rearrangements can be 
associated with altered gene expression levels and profiles (Harewood and Fraser 
2014). Balanced rearrangements (i.e. inversions and reciprocal translocations) result 
in alterations of nucleotide order without gain or loss of genetic material and might 
affect gene expression by moving genes into new regulatory environment or switch 
off genes completely if their coding sequence is destroyed by breakpoint. Contrast¬ 
ingly, unbalanced rearrangements (i.e. duplications, deletions and unbalanced 
translocations) can affect gene dosage through the gain or loss of genetic material. 
Chromosomal rearrangements have already been demonstrated to affect phenotypes 
in multiple species. For instance, yeasts grown in stress-inducing environments, such 
as a glucose-limited setting, show many chromosomal rearrangements after few 
generations, and strains containing chromosomal rearrangements are more resilient 
to starvation (Dunham et al. 2002; Coyle and Kroll 2008). Pig (Sus scrofa) lineage- 
specific EBRs were found enriched for genes related to the sensory perception of 
taste, which might explain why pigs can eat food that is unpalatable for humans 
(Groenen et al. 2012). In addition, rhesus macaque (Macaca mulatto ) EBRs are 
enriched for genes related to immune response (Ullastres et al. 2014), which were 
proposed to be involved in lineage-specific adaptation. Birds also show similar 
patterns. For instance, Farre and collaborators (2016) found that budgerigar 
lineage-specific EBRs are enriched in genes involved in forebrain development, 
which might relate to the different organisations of the “vocal brain nuclei” on this 
species when compared to other vocal learning birds (Farre et al. 2016). They also 
found that Anna’s hummingbird lineage-specific EBRs were enriched for genes on 
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the hexose metabolic process that might relate to this species’ ability to digest sugar 
(Farre et al. 2016). Moreover, there are several examples of association between 
phenotypes and rearrangement polymorphisms in birds. For instance, the lavender 
plumage colour in Japanese quail (Coturnix japonica) was associated with a com¬ 
plex mutation in the region of melanophilin (MLPH) gene which involved two 
inversions and one deletion affecting the order of the MLPH and three neighbouring 
genes (Bed’hom et al. 2012). Additionally, the behavioural morphs in ruffs 
(Philomachus pugnax) are probably caused by a balanced inversion with one 
breakpoint in the CENP-N gene that encodes centromere protein N (Kupper et al. 
2016). 

Chromosome rearrangements are also believed to accelerate genic differentiation 
between populations and, therefore, facilitate speciation (Ayala and Coluzzi 2005). 
Indeed, some avian species were found to possess segregating inversion 
polymorphisms (Itoh et al. 2011). These events can lead to the restriction of 
recombination, building genetic incompatibilities that might result in speciation. 
Such an example is a large pericentric inversion (-100 Mbp) found in white- 
throated sparrow (Zonotrichia albicollis) chromosome 2, which is associated with 
behavioural and plumage variations and believed to be an example of the early 
stages of sex chromosome evolution (Thomas et al. 2008; Davis et al. 2011; Zinzow- 
Kramer et al. 2015). 


6 New Tools for the Study of Chromosomal Evolution 

While cytogenetic techniques provide valuable insights into the gross evolutionary 
structure of chromosomes, their relatively low resolution and inability to efficiently 
detect intrachromosomal evolutionary changes poses significant limitations to 
the study of avian chromosome evolution. Indeed, as referred to previously, 
intrachromosomal rearrangements (e.g. chromosomal inversions) were much more 
frequent in avian evolution than interchromosomal changes, suggesting that com¬ 
plete chromosome assemblies (i.e. where each chromosome is represented by one 
single contiguous sequence) are urgently needed to reveal missing patterns of avian 
genome evolution (Volker et al. 2010; Skinner and Griffin 2012; Zhang et al. 2014; 
Damas et al. 2017) and map economically important phenotypes in poultry species 
(Tuiskula-Haavisto et al. 2002; Richards 2003; Dodgson et al. 2011). 

The generation of chromosome (or near chromosome)-level genome assemblies 
from next-generation sequencing (NGS) data has recently been facilitated by 
the introduction of novel technologies. Explanation of the specifics of each of 
these technologies in detail is beyond the scope of this chapter. Suffice to say 
however, they are all designed towards achieving the goal of chromosome-level 
assemblies. These include long-read single molecule real-time sequencing (SMRT) 
commercialised by Pacific Biosciences (Eid et al. 2009) and Oxford Nanopore 
(Clarke et al. 2009) and the highly multiplexed short-read sequencing used by 
10X Genomics (Zheng et al. 2016). Contigs or scaffolds can further be joined into 
long (super)scaffolds using information on native chromatin conformation [Hi-C; 
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(Lieberman-Aiden et al. 2009)], reconstituted chromatin conformation [Dovetail 
“Chicago”; (Putnam et al. 2016)] or optical mapping [e.g. BioNano; (Mak et al. 
2016)] technologies. While the long-read sequencing technologies theoretically 
should provide one contig (non-gapped DNA sequence) per chromosome, the 
quality and quantity of available DNA as well as the presence of long repeats 
(e.g. centromeres) often limit the length of non-gapped sequences. Moreover, 
novel scaffolding techniques usually show significant levels of disagreement 
between them. Additionally, bioinformatic approaches have been developed to 
significantly improve the contiguity of already existing fragmented NGS assemblies, 
e.g. the Reference-Assisted Chromosome Assembly (RACA; Kim et al. 2013) and 
Ragout (Kolmogorov et al. 2014, 2016). Both algorithms efficiently use comparative 
information (i.e. alignment to phylogenetically related and distant species’ genomes) 
to estimate the order of scaffolds (gapped DNA sequences) in a de novo sequenced 
genome that lacks independent physical maps. Due to limitations of novel sequenc¬ 
ing and mapping techniques, the real promise for making accurate and complete 
chromosome-level genome assemblies nowadays lies in integrative approaches that 
combine several complementary techniques in a single genome assembly project. 
Among birds, this has been achieved for the ostrich genome which was assembled 
by combining Illumina sequencing and optical mapping (Zhang et al. 2015). This 
resulted in an assembly with N50 « 18 Mbp with 90% of the genome included in 
75 superscaffolds (contiguous lengths of DNA sequence), which corresponds to 
approximately 3 superscaffolds per ostrich chromosome. Despite these outstanding 
results, these combined approaches can still not guarantee that each chromosome 
was assembled into a single scaffold and these approaches are associated with 
significant expenses not always possible for sequencing projects. A different 
approach proposed by Damas and co-workers (2017) involved the combination of 
a bioinformatic tool (RACA), wet lab verification by polymerase chain reaction 
(PCR) and fluorescence in situ hybridisation (FISH) to generate, verify and physi¬ 
cally assign predicted superscaffolds to chromosomes. An important advantage of 
this approach was in the design and use of a panel of universal bacterial artificial 
chromosome (BAC) probes that can be utilised with high efficiency (>90%) to map 
superscaffolds to any avian genome, provided metaphase cell preparations are 
available (Fig. 4). Making metaphases can however be technically challenging, 
an actively growing population of cells is an essential prerequisite, and well- 
developed skills in cytogenetics are necessary. This is a cost-efficient way to map 
superscaffolds to chromosomes guaranteeing chromosome-level assembly for 
macrochromosomes and most microchromosomes. Costs associated with this 
method are almost completely limited to FISH mapping of 150-200 universal 
BAC probes to metaphase chromosomes. Damas and co-workers made use of this 
approach to order and orient scaffolds along the rock pigeon (Columba livia ) and 
peregrine falcon chromosomes (Fig. 4), generating assemblies that are comparable, 
in continuity and reliability, to those achieved using traditional mapping techniques 
[e.g. with the assistance of radiation hybrid maps built based on the presence of 
closely located DNA sequences in the same DNA fragment after breaking DNA with 
irradiation (Baxevanis and Ouellette 2004)]. In the near future (funding permitting), 
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Fig. 4 Chromosome-level assembly of the peregrine falcon chromosome 5. FISH mapping of 
universal BAC clones (e.g. CH261-50C15, CH261-72A10 and CH261-4M5) (a) allows to build a 
sparse cytogenetic map (b) comparatively anchored to long superscaffolds/predicted chromosome 
fragments (PCFs) with the universal BAC clone sequence alignments resulting in ordering and 
orienting of superscaffolds (and scaffolds) along the chromosome (c). Alignment of assembled 
chromosome to complete chicken and zebra finch genomes (c) allows for the final verification of the 
peregrine chromosome structure using independent zoo-FISH painting with chicken chromosome- 
specific probes (d) 


avian chromosome-level assemblies will be achieved through the generation of long 
contigs with long-read technologies (e.g. using Oxford Nanopore and Pacific 
Biosciences SMRT technologies), followed by further contig scaffolding using at 
least one mapping technique (e.g. Hi-C and/or optical mapping) and the assignment 
(and verification) of the scaffolds to chromosomes with the set of universal BAC 
clones using FISH. Variations of this approach (e.g. use of RAC A instead of more 
expensive scaffolding techniques) will help to rapidly generate chromosome-level 
assemblies. 
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7 Concluding Remarks 

The preponderance of a similar genome organisation (karyotype) in the vast majority 
of birds suggests an associated functional significance, perhaps more so than in other 
species. By contrast, other species’ genome evolution (e.g. in mammals) is 
characterised by wholesale chromosome change facilitated by the presence of a 
greater number of transposable elements (Skinner and Griffin 2012; Farre et al. 
2016; Damas et al. 2017). Nonetheless, until we fully understand the forces that 
shaped avian genomes, there remain many unanswered questions. For example, the 
mechanisms behind the higher propensity of some avian lineages to successfully fix 
interchromosomal evolutionary changes (e.g. in birds of prey, parrots and penguins) 
are still a matter of future investigation. In addition, the origin and role of 
microchromosomes are not fully understood. Answering these intriguing questions 
will require a greater number of complete chromosome-level assemblies. While new 
sequencing and scaffolding technologies that permit this level of assembly are under 
development, additional tools that allow efficient, inexpensive assembly of avian 
genomes are readily available (e.g. the universal BAC clone panel). With these 
methods in hand, it is therefore reasonable to expect that the “mystery” of conserved 
avian karyotypes will be unravelled in the very near future. 
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Abstract 

How much do we really know about bird genomes? Like other eukaryotic 
genomes, the genomes of birds contain repetitive DNA—tandem repeats, 
transposable elements, and endogenous viruses. Repetitive regions are notori¬ 
ously difficult to assemble and often remain inaccessible as gaps within genome 
assemblies, a situation which may be metaphorically referred to as genomic “dark 
matter.” Here we review avian repetitive DNA from an integrated avian genomics 
and cytogenetics perspective. While bird genomes are generally relatively repeat- 
poor, some genomic regions consist almost entirely of repetitive elements. Parti¬ 
cularly repeat-rich are centromeres, telomeres, and surrounding regions, as well 
as the female-specific non-recombining W chromosome. Many of these regions 
are entirely inaccessible with short-read sequencing but may be much better 
resolved with long-read sequencing and other single-molecule technologies. We 
further discuss how repetitive elements may have directly impacted bird speci- 
ation through host-parasite arms races, meiotic drive, and changes in genome 
structure. We conclude with a model for improving genome assemblies and 
anticipate that the resolution of genomic “dark matter” will permit a deeper 
understanding of bird genomes. 
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1 What Is Genomic Dark Matter? 

Genome assembly—a substantial prerequisite for genomics analyses—is nothing 
else but a complex jigsaw puzzle. Although there is now a multitude of different 
sequencing technologies, each with their own advantages and disadvantages 
(reviewed by Goodwin et al. 2016), none of these are capable of sequencing entire 
chromosomes, at least for bird genomes. As a result, sequencing data are fragments 
of the genome which need to be pieced together into genome assemblies. We thus 
emphasize that all genome assemblies are only models of the “true” genome, as the 
assembly process is comparable to doing a puzzle game without ever knowing how 
the outcome shall look like. 

Nearly all avian genome assemblies currently available have been sequenced 
using DNA sequencing technologies that produce short sequence fragments, in the 
order of 50-500 bp (Kraus and Wink 2015), hereafter referred to as short-read 
sequencing data (see Table SI in Kapusta and Suh 2017). Short-read sequencing 
is usually done by paired-end sequencing of DNA fragments of a specific size 
(reviewed by Goodwin et al. 2016), resulting in genome assemblies which consist 
of contiguous sequences (“contigs”) where all nucleotides have been determined and 
linked contigs (“scaffolds”) which contain assembly gaps of undetermined “N” 
nucleotides as placeholders (Yandell and Ence 2012). Sequencing reads thus corre¬ 
spond to tens of millions of puzzle pieces in a 1 Gb bird genome. Additionally, 
eukaryotic genomes usually contain various types of repetitive elements (Fig. 1), 
ambiguous pieces which may fit in several places of the puzzle. Individual units of 
repetitive DNA can either be dispersed throughout the genome (“interspersed 
repeats,” IRs; Fig. la) or repeated adjacently to each other (“tandem repeats,” 
TRs; Fig. lb) and are difficult to assemble either if individual repeat units are highly 
similar or if repeat units or repeat arrays are longer than individual reads (Chaisson 
et al. 2015). Even if read pairs span individual IRs or short arrays of TRs, the central 
part of their sequence may be undeterminable and thus constitute a scaffold gap (left 
half of Fig. 1). In addition, repetitive regions with several IRs or long arrays of TR 
are usually not spanned by read pairs and thereby lead to scaffold ends (right half of 
Fig. 1) (Chaisson et al. 2015). 

Altogether, it seems not too surprising that most avian genome assemblies consist 
of thousands or even tens of thousands of sequences (scaffolds), instead of a number 
equal to their known chromosome number (Table 1). This 1000-fold excess in 
numbers of expected versus assembled sequences results from between-scaffold 
gaps of unknown in size. Additionally, most avian genome assemblies contain 
multiple millions of “N” (i.e., unknown) nucleotides (Table 1) that correspond to 
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Fig. 1 Schematic illustration of how repetitive elements are responsible for “dark matter” in 
genome assemblies, (a) Interspersed repeats (IRs; blue) such as transposable elements (TEs) and 
endogenous viruses (EVEs) can lead to assembly gaps within scaffolds. Large IRs or IR-rich 
regions can lead to gaps between scaffolds, (b) Tandem repeats (TRs; subscript numbers denote 
units per repeat array) such as micro satellites (red) and satellites (orange) can either lead to 
assembly gaps within scaffolds or to muted gaps, i.e., sequence contraction relative to the “true” 
genome. Tandem repeats with large units or arranged as large arrays can lead to gaps between 
scaffolds. Figure from Peona et al. (2018), used under CC BY license 


within-scaffold gaps of approximately known size. To add yet another metaphor, it 
may be appropriate to collectively refer to these undetermined (but nevertheless 
existing!) DNA sequences as genomic “dark matter,” “analogous to the unexplained 
dark matter comprising much of the mass in the universe” (Johnson et al. 2005). It is 
worth noting that genomic “dark matter” is not unique to birds—in fact, even the 
best-studied human Drosophila and mouse genome assemblies are still missing large 
parts of their highly repetitive centromeres and telomeres (reviewed by Miga 2015; 
Peona et al. 2018). 

One may ask how much of the bird genome has so far been accessible to avian 
genomics and how much DNA constitutes genomic “dark matter.” Considering 
existing fluorometric genome size measurements (Gregory 2017), we estimated an 
average of 18% genomic dark matter for those bird species where both cytogenetic 
and genome assembly data are available (estimates range from 0.9 to 42.4%; 
Table 1, Peona et al. 2018). Most of these genomes are based on short-read 
sequencing, indicating that a significant portion of avian genomes remains inacces¬ 
sible. The only exceptions are zebra finch and chicken where the expected and 
assembled genome sizes are very similar (Table 1). The zebra finch and chicken 
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genome assemblies were generated using Sanger (Warren et al. 2010) and long-read 
sequencing (Warren et al. 2017), respectively, technologies with reads over 5-fold 
and 100-fold longer than short-read sequencing. Nevertheless, even these assemblies 
usually contain tens of thousands of scaffolds (Table 1, Peona et al. 2018), 
suggesting that having significantly fewer and larger puzzle pieces helps with the 
assembly of many, but not all repetitive regions of the genome. 

With this in mind, in this chapter we discuss the current knowledge from avian 
genomics about the diversity, distribution, and evolution of repetitive elements. We 
further highlight which parts of avian genomes currently constitute genomic “dark 
matter” through the discrepancy between what is known from cytogenetics and what 
is so far lacking in genomic data. Ultimately, a comprehensive resolution of genomic 
“dark matter” requires an integration of cytogenetics into genomics and will permit a 
deep understanding of bird genomes. 


2 Interspersed Repeats 

Interspersed repeats are repetitive elements occurring in multiple nonadjacent 
locations and with direct or indirect means to disperse throughout genomes. Such 
“genomic parasites” are present in virtually all cellular genomes (Lee and Langley 
2010) and comprise two major groups, transposable elements (TEs) (Kazazian 
Jr. 2004) and endogenous viral elements (EVEs) (Katzourakis and Gifford 2010). 
Class I TEs (retrotransposons; also including retroviruses) mobilize through 
retrotransposition, a copy-and-paste mechanism involving reverse transcription of 
the TE RNA sequence and insertion of the resulting TE complementary-DNA 
(cDNA) into the genome (Fig. 2a) (reviewed by, e.g., Levin and Moran 2011). On 
the other hand, class II TEs (i.e., DNA transposons) usually disperse through 
transposition, a cut-and-paste mechanism where the TE DNA sequence is excised 
and reinserted into the genome (Fig. 2b) (reviewed by, e.g., Levin and Moran 2011). 
A third major mechanism is the occasional insertion of non-retroviral EVEs into the 
genome via nonhomologous end joining (NHEJ) between virus DNA and the 
genome during DNA repair (Fig. 2c) (reviewed by, e.g., Feschotte and Gilbert 2012). 

Birds have a much lower overall diversity of major TE groups (“superfamilies”) 
than other land vertebrates (see Fig. 1 of Kapusta and Suh 2017) and considerably 
less EVEs (Cui et al. 2014). Nevertheless, recent years of avian genomics have 
discovered a variety of peculiarities across the avian tree of life, such as TE de novo 
emergences, horizontal transposon transfer, and germline integration of 
non-retroviral EVEs (Fig. 3). In the following, we discuss these findings in detail 
in the context of the different propagation modes of genomic parasites. These 
genomic parasites are either “autonomous” if they encode for their own de¬ 
mobilizing enzymes or “nonautonomous” if they cannot mobilize on their own but 
rely on rra^e-mobilization by the enzymatic machinery of autonomous elements. 
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Fig. 2 Schematic illustration of the three major mechanisms for the mobility of interspersed 
repeats, (a) Copy-and-paste retrotransposition typical for retrotransposons and retroviruses. A 
reverse transcriptase enzyme (gray) binds the retrotransposon RNA (blue line) and catalyzes reverse 
transcription and complementary-DNA (cDNA) integration into a new target site (white square) via 
staggered cuts, resulting in a target site duplication flanking the TE insertion, (b) Cut-and-paste 
transposition typical for DNA transposons. A dimer of the transposase enzyme (gray) binds the 
transposon DNA (blue box) and catalyzes excision and integration into a new target site (white 
square) via staggered cuts, resulting in a target site duplication flanking the TE insertion, (c) 
Nonhomologous end joining (NHEJ) leading to virus integration. Following viral infection, a 
DNA double-strand break (DSB) is repaired by NHEJ through short sequence homologies between 
the target site (white square) and the viral DNA. In contrast to transposition and retrotransposition, 
an endogenous viral element (EVE) insertion resulting from NHEJ is not flanked by a target site 
duplication 


2.1 Class I Transposable Elements 

2.1.1 LINE Retrotransposons 

Long interspersed elements (LINEs) are retrotransposons which encode a reverse 
transcriptase (RT) protein with an endonuclease (EN) domain and often an ORF1 
protein of unknown function (Fig. 4) (Wicker et al. 2007; Kapitonov and Jurka 
2008). LINEs mobilize via target-primed reverse transcription (TPRT) (Luan et al. 
1993; Ichiyanagi and Okada 2008). The TPRT mechanism is initiated by 
EN-catalyzed nicking of the target site in the genome, followed by reverse transcrip¬ 
tion of the LINE mRNA into cDNA starting from its 3'-end (reviewed by Martin 
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Fig. 3 Distribution of interspersed repeats across the avian tree of life. Key events of TE evolution 
(Kapusta and Suh 2017) such as horizontal transfers (HT) were mapped on a recent consensus 
phylogeny of major avian lineages (Suh 2016) which accounts for phylogenetic uncertainty in 
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Fig. 4 Long interspersed elements (LINEs) of birds. Schematic organization of the CR1, RTE, and 
R2 superfamilies. Full-length elements exhibit 5' and 3 ' untranslated regions (5' and 3 ' UTRs, blue 
and red) and a coding region (gray) encoding an unknown protein (ORF1), an apurinic endonucle¬ 
ase (APE) or an endonuclease (EN), and a reverse transcriptase (RT). The 3 ' UTR ends with a 
characteristic micro satellite motif, and white squares are target site duplications (“v” indicates 
variable lengths). Figure modified from Kapusta and Suh (2017), used with permission from John 
Wiley and Sons 


2006; Levin and Moran 2011). Premature termination of TPRT is common, and 
LINE daughter copies are therefore frequently S'-truncated, rendering most LINE 
copies “dead on arrival” because they lack the sequence features required for their 
own transcription and retrotransposition (Levin and Moran 2011). 

Nearly all avian LINEs (-65,000-630,000 copies per genome; Zhang et al. 2014) 
belong to the chicken repeat 1 (CR1) superfamily and together make up 39-88% of 
all TE copies in available bird genome assemblies (Hillier et al. 2004; Warren et al. 
2010; Zhang et al. 2014). This CR1 dominance is common for the genomes of most 
land vertebrates (Shedlock et al. 2007; Suh et al. 2015a; Janes et al. 2010). Avian 
CR1 copies have been grouped into at least 14 families (Hillier et al. 2004; Warren 
et al. 2010), many of which were active over long parts of avian evolution (Matzke 
et al. 2012; Kriegs et al. 2007; Suh et al. 2011b) and have been used as phylogenetic 
markers for the avian tree of life (e.g., Suh et al. 2011b; Haddrath and Baker 2012; 
Kaiser et al. 2007). Although some CR1 copies appear to be ancient (Vandergon and 
Reitman 1994), the CR1 families present in bird genomes emerged from a single 
ancestral CR1 lineage after the bird/crocodilian divergence (Suh et al. 2015a; Suh 
2015). The vast majority of CR1 copies is S'-truncated (Vandergon and Reitman 
1994; Wicker et al. 2005; Hillier et al. 2004), but even the shortest copies usually 

< - 

Fig. 3 (continued) genome-scale phylogenies (Jarvis et al. 2014; Prum et al. 2015; Suh et al. 
2015b). The distribution table (right) contains TEs which are ubiquitous across bird genomes (CR1 
LINEs and LTR retrotransposons or endogenous retroviruses, ERVs; Kapusta and Suh 2017) and 
known non-retroviral EVEs (hepadnaviruses, circoviruses, parvoviruses, and bomaviruses; Cui 
et al. 2014). Interspersed repeats are either present in all (“+”), some (“±”), or none (“-”) of the 
genomes sampled from the respective lineage. Left part of figure modified from Kapusta and Suh 
(2017), used with permission from John Wiley and Sons 
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contain the typical hairpin structure and 8 bp micro satellite motifs of the 3 ' UTR 
(Fig. 4) (Suh 2015). Whether these CR1 structures affect nearby genes remains 
unknown (Kapusta and Suh 2017). 

In addition to the ubiquitous CR1 elements, most bird genome assemblies contain 
very low copy numbers of R2 elements (Kojima et al. 2016), a peculiar LINE 
superfamily widely distributed across animals and which exhibits site-specific 
retrotransposition into 28S rRNA genes (Kojima and Fujiwara 2005). Finally, 
some unrelated bird lineages (songbirds, some parrots, hornbills, trogons, 
hummingbirds, mesites, and tinamous; Fig. 3) contain thousands of copies of the 
AviRTE element from the RTE superfamily as a result of repeated horizontal 
transfer between these birds and parasitic nematodes (Suh et al. 2016). The RTE 
superfamily appears to be notorious for horizontal transfer in general, given that 
another RTE element (BovB) “jumped” repeatedly between the genomes of some 
mammals, reptiles, and ticks (Kordis and Gubensek 1998; Walsh et al. 2013). 

2.1.2 SINE Retrotransposons 

Short interspersed elements (SINEs) are parasites of LINEs. As the name implies, 
they are too short for having protein-coding capacity and thus rely on the LINE 
enzymatic machinery. SINEs are part of the small RNA transcrip tome as their heads 
(5' ends) are derived from small RNA genes (usually from tRNAs) and their internal 
promoters allow high transcription levels (Wicker et al. 2007 ; Kapitonov and Jurka 
2008). SINE tails (3' ends) are usually highly similar to the 3 ' UTRs of LINEs they 
parasitize, which allows frequent retrotransposition of SINE RNAs by “hijacking” 
the LINE RT protein (Ohshima et al. 1996). As a result, SINEs by far outnumber 
LINEs in the human genome (-1,600,000 SINEs vs. -900,000 LINEs; Lander et al. 
2001 ). 

In contrast, bird genomes have relatively low SINE copy numbers of 
-6000-17,000 (vs. -65,000-630,000 LINEs; Zhang et al. 2014). Most of these 
SINEs belong to the ancient families MIR, AmnSINE, and LFSINE which were 
mobilized long before the last common ancestor of birds (reviewed by Kapusta and 
Suh 2017) and are present across land vertebrates (Hirakawa et al. 2009; Bejerano 
et al. 2006; Nishihara et al. 2006; Green et al. 2014). At least some of these ancient 
SINEs might be involved in 3D genome folding as previously shown for mammals 
(Schmidt et al. 2012; Wang et al. 2015) and suggested by the high sequence 
conservation of hundreds of orthologous SINE copies across birds (Craig et al. 
2018). 

There are, however, SINEs with recent activity in some avian lineages due to 
recurrent de novo emergence (Fig. 3). CR1 -mobilized SINEs emerged twice within 
passerines and independently in pelicans and trogons (Suh et al. 2017; Kapusta and 
Suh 2017; Warren et al. 2010) and contained tRNA-derived SINE heads and SINE 
tails with the typical structural features of CR1 3 ' UTRs (Fig. 5) (Suh 2015). 
Furthermore, RTE-mobilized SINEs emerged three times in suboscine passerines 
and independently in some parrots and hornbills (Suh et al. 2016). These 
RTE-SINEs have a peculiar diversity of SINE heads (e.g., derived from tRNA, 
5S-rRNA, or 28S-rRNA genes; Suh et al. 2016) and tails derived from the 5' and 3 ' 
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Fig. 5 Short interspersed elements (SINEs) of birds. Schematic organization of the CR1-SINE and 
RTE-SINE superfamilies. Full-length elements exhibit a SINE head derived from a small RNA 
(sRNA; yellow; usually a tRNA) and a SINE tail derived from the LINE element responsible for 
SINE mobilization. In CR1-SINEs, the SINE tail is homologous to the CR1 3' UTR (red), and in 
RTE-SINEs, it is homologous to the RTE 5' and 3' UTRs (blue and red). Corresponding to its 
mobilizing LINE, the SINE tail ends with a characteristic microsatellite motif, and white squares are 
target site duplications (“v” indicates variable lengths). Figure modified from Kapusta and Suh 
(2017), used with permission from John Wiley and Sons 

UTRs of AviRTE LINEs (Fig. 5). This bipartite organization of SINE tails seems to 
be typical for RTE-mobilized SINEs (Gogolevsky et al. 2008; Kojima 2018). 

2.1.3 LTR Retrotransposons 

In contrast to LINEs and SINEs, often collectively termed “non-LTR 
retrotransposons,” LTR retrotransposons mobilize via replicative retrotransposition, 
a mechanism mediated by “long terminal repeats” (reviewed by, e.g., Kazazian 
Jr. 2004; Levin and Moran 2011). LTRs are thus “repeats within a repeat” 
(Fig. 6). Importantly, retroviruses belong to the group of LTR retrotransposons 
(Wicker et al. 2007). Full-length LTR retrotransposons encode several protein 
domains, including a reverse transcriptase and a capsid protein (Fig. 6). During 
replicative retrotransposition, the LTR retrotransposon mRNA is first reverse- 
transcribed within a virus-like particle in the cytoplasm, and the cDNA is transported 
back into the nucleus for insertion into the genome (Levin and Moran 2011). In 
contrast to TPRT which leads to frequent S'-truncation of daughter LINEs and 
SINEs, replicative retrotransposition yields full-length daughter LTR 
retrotransposons (Levin and Moran 2011). Recently inserted LTR retrotransposons 
therefore contain two LTRs which are identical to each other and often cause intra¬ 
element ectopic recombination (Devos et al. 2002). This process deletes one of the 
two LTRs and the internal protein-coding regions, thus resulting in a solo-LTR 
within the original target site duplication (Fig. 6b). Solo-LTRs are unable to 
retrotranspose; however, they contain promoters and may thus influence the tran¬ 
scription of nearby genes (Kovalskaya et al. 2006; Chuong et al. 2017). 

LTR retrotransposons are present in genome assemblies from all major avian 
lineages (Fig. 3). Birds generally contain only retrovirus-like LTR retrotransposons 
(endogenous retroviruses, ERVs), more precisely from the superfamilies ERV1, 
ERVK, and ERVL (reviewed by Kapusta and Suh 2017; Suh et al. 2018). These 
superfamilies can be distinguished by the specific lengths of their target site 
duplications (Fig. 6) (Kapitonov and Jurka 2008), allowing even the classification 
of solo-LTRs which by far outnumber full-length elements in birds (Wicker et al. 
2005). On the other hand, the virological classification of ERVs is based on their 
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Fig. 6 Long terminal repeat (LTR) retrotransposons of birds, (a) Schematic organization of the 
endogenous retrovirus (ERV) superfamilies ERV1, ERV2 (ERVK), and ERV3 (ERVL). Full- 
length elements exhibit a pair of LTRs (green) and a coding region (gray) encoding a capsid protein 
(GAG), an aspartic proteinase (AP), a reverse transcriptase (RT), an RNase H (RH), an integrase 
(INT), and an envelope protein (ENV). White squares are target site duplications (numbers indicate 
specific lengths). Figure modified from Kapusta and Suh (2017), used with permission from John 
Wiley and Sons, (b) Schematic illustration of nonallelic homologous recombination (NAHR) 
between the LTRs of a full-length LTR retrotransposon. This process deletes the internal coding 
region and one of the LTRs, leaving behind a solitary LTR (solo-LTR) 


protein-coding genes and suggests that bird genomes contain mostly 
alpharetroviruses and gammaretroviruses (Hayward et al. 2015) (but see (Cui et al. 
2014) reporting mostly betaretroviruses and gammaretro viruses). Total copy num¬ 
bers of LTR retrotransposons are -24,000-120,000 copies in 48 sampled genome 
assemblies (Zhang et al. 2014), which suggests that some avian lineages witnessed 
more LTR retrotransposon activity than others. Accordingly, analyses of 
orthologous solo-LTRs for their presence/absence in bird genomes indicate abun¬ 
dant and diverse activity of LTR retrotransposons, especially during the neoavian 
radiation (Fig. 3) (Suh et al. 2011b, 2015b). Nevertheless, the highest LTR 
retrotransposon activity so far reported from birds occurred relatively recently in 
the genomes of songbirds (reviewed by Kapusta and Suh 2017). A comparison of the 
three available in-depth, manually curated repeat annotations for birds (chicken, 
zebra finch, and collared flycatcher) (Warren et al. 2010; Hillier et al. 2004; Suh et al. 
2018) suggests that LTR retrotransposons are not only more abundant in songbirds 
than in chicken but also more diverse in terms of the presence of distinct LTR 
retrotransposon families and subfamilies (Suh et al. 2018). Most of these LTR 
subfamilies are lineage-specific, i.e., existing in either zebra finch (175 subfamilies 
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Fig. 7 DNA transposons of birds. Schematic organization of the mariner and hAT superfamilies. 
Full-length elements exhibit a pair of terminal inverted repeats (TIRs; purple) and encode a DDE 
transposase. Nonautonomous (NA) elements lack transposase-coding capacity. White squares are 
target site duplications (letters indicate specific motifs and numbers indicate specific lengths). 
Figure modified from Kapusta and Suh (2017), used with permission from John Wiley and Sons 


from 17 LTR families) or collared flycatcher (33 subfamilies from 14 LTR families), 
which implies frequent germline invasions of songbirds by a variety of retroviruses. 
Strikingly, this songbird LTR diversification coincides with a recent replacement of 
LINE activity by LTR activity, which occurred independently in the collared 
flycatcher and zebra finch lineages (Suh et al. 2018). In-depth repeat annotations 
of further songbird genomes are necessary to establish the timeframe and extent of 
co-speciation between retroviruses and the species-rich songbirds. 


2.2 Class II Transposable Elements 
2.2.1 DNA Transposons 

DNA transposons usually move from one genomic locus to another through a cut- 
and-paste mechanism catalyzed by their own transposase proteins (Fig. 7). A dimer 
of this transposase recognizes the terminal inverted repeats (TIRs), excises the DNA 
transposon at its termini, and reinserts the transposon DNA elsewhere (Fig. 2; 
reviewed by, e.g., Kidwell 2005; Feschotte and Pritham 2007; Levin and Moran 
2011). Some DNA transposons lack coding capacity for a transposase and are thus 
nonautonomous (Feschotte and Pritham 2007). The only structural hallmarks present 
in all DNA transposons are TIRs and target site duplications (TSDs) with specific 
lengths and motifs (Fig. 7) (Wicker et al. 2007). 

Bird genomes generally contain very low amounts of DNA transposons. Most of 
these are Charlie7 elements from the hAT superfamily and accumulated during early 
bird evolution, before the divergence of chicken and zebra finch (Warren et al. 2010; 
Kapusta and Suh 2017). Then again, the chicken genome contains over 16,000 
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copies of relatively recently inserted DNA transposons, namely, Galluhop from the 
mariner superfamily and Charliel2 from the hAT superfamily (Wicker et al. 2005). 
These appear to have been active before the divergence of chicken and turkey 
(Kapusta and Suh 2017). Very few chicken DNA transposon copies are 
transposase-encoding elements; most are short nonautonomous elements (Fig. 7) 
(Wicker et al. 2005). Recent evidence further suggests the presence of Galluhop 
transposons in other Galliformes and a horizontal transfer event to hornbills (Fig. 3) 
(Bertocchi et al. 2017). 

2.2.2 Other Transposons 

There are two subclasses among class II DNA transposons which employ different 
mobilization mechanisms than the aforementioned cut-and-paste DNA transposons 
(Wicker et al. 2007). These subclasses are Helitrons which copy and paste by rolling- 
circle replication (Kapitonov and Jurka 2007) and Mavericks (also known as 
Polintons) which copy and paste by self-replication (Kapitonov and Jurka 2006; 
Feschotte and Pritham 2005). Both mechanisms differ from class I retrotransposons 
in that they do not involve an RNA intermediate. Although discovered much more 
recently than cut-and-paste DNA transposons, Helitrons and Mavericks are widely 
distributed across eukaryotes including various vertebrates (Feschotte and Pritham 
2007; Thomas et al. 2010; Pritham et al. 2007). To our knowledge, none of these 
peculiar DNA transposons have yet been reported from bird genomes. 


2.3 Endogenous Viruses 

Endogenous viral elements (EVEs) are genomic “fossils” of virus infections of the 
germline and have since been integrated in the genome. Thus far, EVEs from a wide 
range of virus groups have been unearthed (reviewed by, e.g., Feschotte and Gilbert 
2012). The most common and widespread EVEs in birds are retroviruses (Cui et al. 
2014) which are reverse-transcribing single-strand RNA (ssRNA) viruses 
( Retroviridae ) with obligate integration into the host genome through retro- 
transposition (Feschotte and Gilbert 2012). Retroviruses can also be classified as 
LTR retrotransposons (see Sect. 2.1.3). On the other hand, although some 
non-retroviral EVEs may occasionally integrate into host genomes with the help of 
TE-encoded enzymes, many insertions lack typical hallmarks of (retro)transposition 
such as target site duplications (Fig. 2a) (Feschotte and Gilbert 2012). A plausible 
explanation is that most of the non-retroviral EVEs integrated through nonhomo- 
logous end joining (Fig. 2c), an error-prone pathway of the host genome to repair 
DNA double-strand breaks (Feschotte and Gilbert 2012). The diversity of 
non-retroviral EVEs identified in birds comprises ssDNA viruses ( Circoviridae , 
Parvoviridae ), ssRNA viruses ( Bornaviridae ), and reverse-transcribing dsDNA 
viruses ( Hepadnaviridae ) (Cui et al. 2014), all of which are present in low copy 
numbers. The reasons for the avian scarcity of non-retroviral EVEs remain debated 
(Kapusta and Suh 2017). 
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The EVEs of circoviruses, parvoviruses, and bomaviruses have very patchy 
distributions across the avian tree of life (Fig. 3) as they each were detected in 
only 3-10 of the sampled 48 bird genomes (Cui et al. 2014). In contrast, EVEs of 
hepadnaviruses have been detected in all analyzed bird genomes to the exception of 
chicken and turkey (Suh et al. 2013; Cui and Holmes 2012; Cui et al. 2014; Liu et al. 
2012a; Gilbert and Feschotte 2010; Katzourakis and Gifford 2010). While it remains 
unclear for most EVEs whether they are present in closely related species due to 
either orthologous or independent germline infiltration, presence/absence analyses 
of some hepadnaviral EVEs have provided direct evidence that birds have been 
infected by hepadnaviruses for >70 My since at least the ancestor of Neoaves (Suh 
et al. 2013). This duration of hepadnavirus-bird association is the longest so far 
known from their amniote hosts (Suh et al. 2014). 


3 Tandem Repeats 

Tandem repeats are repetitive elements which are adjacent to each other and thereby 
form repeat arrays. These can emerge and proliferate through a variety of 
mechanisms (Fig. 8). Below we discuss different types of tandem repeats in ascend¬ 
ing order of repetitive element size—ranging from microsatellites, minisatellites, and 
satellites to gene families and copy number variations. 


3.1 Microsatellites and Minisatellites 

Micro satellites are short DNA motifs (<10 bp) which occur in arrays that typically 
do not extend further than a few hundred base pairs (Ellegren 2004). They are 
ubiquitous in eukaryote genomes and scattered along chromosomes in numerous 
places. Predominantly nonfunctional and thus under little selective constraint 
(exceptions include regulation of chromatin organization and gene expression, Li 
et al. 2002), microsatellites are among the fastest evolving sequences with mutations 
arising through errors in the DNA replication process. The so-called replication 
slippage (Fig. 8a), which is increased with the length of the microsatellite array, 
causes alleles mainly to differ by copy number rather than single-nucleotide changes 
(Ellegren 2004; Schlotterer and Tautz 1992). The resulting high number of alleles 
accounts for the extremely high variation in microsatellite loci, compared to the 
genome-wide average, thus facilitating a wealth of genetic analyses. Among the 
most important is the genotyping in paternity analysis, which allows the correct 
assignment of biological parents to a given offspring. Building on that, 
microsatellites have been used early on for genetic mapping in pedigree studies, to 
assess genetic diversity and structure in population genetic studies (e.g., Crooijmans 
et al. 1996; Hansson et al. 2000), and more recently for population monitoring based 
on environmental samples and wildlife forensics (De Barba et al. 2016; Jan and 
Fumagalli 2016). 
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Fig. 8 Mechanisms of tandem repeat array proliferation, (a) Replication slippage is based on the 
inaccuracy of the DNA polymerase. When replicating a tandem repeat array, it sometimes jumps 
back a few base pairs and repeatedly replicates some nucleotides, leading to a repeat expansion. If 
the array forms a hairpin loop, the polymerase might replicate only the bases connecting the hairpin, 
resulting in a repeat contraction, (b) Gene conversion occurs if a double-strand break during meiosis 
is not resolved with a crossing-over but with the total replacement of one allele in the homologous 
sequence, resulting in a homogenization of alleles. This can also happen via ectopic recombination 
including paralogous sequences, (c) Rolling-circle replication has thus far only rarely been shown in 
relation to satellite DNA dynamics. After hairpin loop formation, the sequence—often one repeat 
unit—circularizes and acts as a template for DNA replication, (d) In unequal crossing-over, high 
sequence identity—typically the case in tandem repeat arrays—leads to imprecise pairing of 
homologous chromosomes. After resolution of the crossing-over, half of the meiotic products are 
imbalanced, with either deleted or duplicated sequence 


Although occurring in a lower density than in, for example, humans (Primmer 
et al. 1997; Guizard et al. 2016), a plethora of avian microsatellite loci and the 
corresponding PCR primers have been described for many different species and 
taxonomic groups. Given that the applicability of a certain locus for cross-species 
amplification is usually constrained by the divergence between two species (Prim¬ 
mer et al. 2005), there are only a few “universal” loci which can be used for a wide 
range of species. With the emergence of readily available high-throughput sequenc¬ 
ing and its wealth of sequence data, microsatellites were thought to become obsolete 
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(Schlotterer 2004). However, in cases where cost-efficiency, number of individuals, 
or DNA quality plays a role (e.g., wildlife monitoring, behavioral studies), 
microsatellites are still the method of choice (but see Kraus et al. 2015). Addition¬ 
ally, the discovery of suitable loci has become much easier since many genome 
assemblies have become available. By aligning genome assemblies of target species 
to each other, conserved loci can be found (Kupper et al. 2008; Dawson et al. 2010, 
2013), thus providing a valuable complement to traditional microsatellite studies. 
Additionally, high-throughput sequencing technologies enable the genome-wide 
investigation of microsatellite evolution, an area of research which is, due to 
technological and analytical constraints (Fig. 1), only in its early stage of develop¬ 
ment (Gymrek 2017). 

Minisatellites are tandem repeat arrays with unit sizes longer than 10 bp. They 
evolve via similar mechanisms as microsatellites, but they occur far less frequently 
and are presumably more relevant to gene function (Lopez-Flores and Garrido- 
Ramos 2012). Although there are a few examples of minisatellite loci characterized 
in birds (e.g., Hanotte et al. 1991, 1992), the much higher abundance of microsatel¬ 
lite loci has led to a quick replacement of minisatellites for the abovementioned 
applications (Lopez-Flores and Garrido-Ramos 2012). 


3.2 Satellites 

Satellites are generally defined as repetitive DNA which occurs in tandem repeat 
arrays with repeat units between 200 and several thousand base pairs, although 
differences in the size limit vary and also depend on the organism studied (Lopez- 
Flores and Garrido-Ramos 2012; George and Alani 2012). The term “satellite DNA” 
originated in the description of a density gradient in DNA centrifugation analysis 
and refers to one or several separate bands seen, like satellites orbiting a planet. The 
reason for this lies in the fact that satellite repeat sequences often deviate from an 
equilibrium base composition (i.e., A, C, T, and G in roughly equal numbers), such 
that the resulting density also changes compared to the genome-wide average. If the 
satellite repeat then occurs in a high copy number, a discrete band will be seen in the 
density gradient (Plohl et al. 2012). Another common technique to detect satellite 
DNA is to digest genomic DNA with a restriction enzyme and fractionate bands via 
centrifugation in cesium chloride (Manuelidis 1977). To determine the smallest 
repeat unit, multiple enzymes are used sequentially with a size standard on the gel, 
combined with cloning and sequencing of the band (Madsen et al. 1992). Once the 
sequence of a repeat is known, it can be used for fluorescence in situ hybridization to 
determine its location along the genome and associate it with certain chromosomal 
features, such as the centromere (e.g., Longmire et al. 1988). 

Satellites are an abundant source of repetitive DNA and seem to play an important 
role in maintaining genome integrity and organization as a main element of consti¬ 
tutive heterochromatin (Grewal and Jia 2007; Peng and Karpen 2008; George and 
Alani 2012). These regions of the chromosome, which are usually associated with 
repetitive sequences, are densely packed through histone modifications throughout 
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the stages of mitosis and meiosis. Although mainly distributed in or near 
centromeres and near telomeres of most chromosomes, satellites also occur in 
heterochromatic fractions of sex chromosomes (Lopez-Flores and Garrido-Ramos 
2012) and, in shorter arrays, scattered along various chromosomes (Saksouk et al. 
2015). 

Satellite arrays can expand and contract via a range of different mechanisms 
(Fig. 8b-d). In gene conversion, a double-strand break during meiosis is followed by 
the invasion of one strand and ultimately resolved by the complete replacement of 
one allele, thus homogenizing the tandem repeat array (Chen et al. 2007, Fig. 8b). 
Rolling-circle replication facilitates a circularized stretch of DNA (e.g., one unit of a 
satellite) which replicates and inserts itself into the tandem array (Rossi et al. 1990, 
Fig. 8c). Finally, unequal crossing-over is another mechanism facilitating the expan¬ 
sion and contraction of tandem repeat arrays. Hereby an imprecise pairing of 
homologous sequences during meiosis leads to unbalanced chromosomes after 
meiosis (Graur and Li 2000, Fig. 8d). Additionally, Bersani et al. (2015) have 
shown that satellites can proliferate via RNA-DNA hybrid molecules, although 
this seems to be restricted to mitotic cancer cells. 

Despite avian genomes being comparatively repeat-poor (reviewed by Kapusta 
and Suh 2016), cytogenetic data suggest that satellites make up a considerable 
proportion of the total amount of DNA of avian genomes (e.g., a 190 bp satellite 
accounts for up to 6.8% of the entire genome of some parrot species, Madsen et al. 
1992). Avian satellites have been extensively studied using cytogenetic and molec¬ 
ular genetic methods, and a variety of different satellite families have been described 
(e.g., Kodama et al. 1987; Matzke et al. 1990, 1992; Wang et al. 2002). Strikingly 
though, there are very few examples of comprehensive genome-wide avian satellite 
studies. This might partly stem from the fact that satellites, or rather entire arrays of 
satellites, are notoriously difficult to assemble in a linear fashion due to reasons 
mentioned above (see Sect. 1). One exception is the comprehensive characterization 
of repetitive DNA in the previous chicken genome assembly (Guizard et al. 2016). In 
this study, 0.24% of the genome assembly could be attributed to satellite DNA arrays 
with more than 50 copies and unit sizes ranging from 60 bp to 2 kb. Furthermore, a 
14 kb satellite with a 1.4 kb subunit has been identified in crows via optical mapping, 
long-read sequencing, and manual sequence assembly curation and shown to poten¬ 
tially be a component of centromeric and subtelomeric heterochromatin 
(Weissensteiner et al. 2017). This satellite forms large-scale tandem repeat arrays 
(considerably longer than 100 kb) which were anchored into the crow genome 
assembly by the combination of different approaches and were shown to influence 
population genetic parameters. In regions adjacent to these long tandem repeat 
arrays, genetic diversity is drastically reduced, which also leads to increased genetic 
differentiation between hooded and carrion crow populations. This study illustrates 
the necessity of considering the location of structural genomic features (such as the 
presence of large tandem repeat arrays) for downstream genomic analyses, in order 
to complement conclusions drawn from single-nucleotide polymorphism data. 



Repetitive DNA: The Dark Matter of Avian Genomics 


111 


3.3 Gene Families and Copy Number Variation 

Gene families represent genetic elements which originated by a duplication event of 
a gene and are arranged either in a tandem array or spread out as interspersed copies 
(Lopez-Flores and Garrido-Ramos 2012). Depending on the age of the duplication 
event, these paralogs can be quite diverged and can both lose their function or adopt 
new ones. The number of paralogs can range between less than 10 and several 
hundred (Lopez-Flores and Garrido-Ramos 2012). Among the most notable avian 
examples is the beta (P) keratin multigene family. In chicken, 111 complete P-keratin 
paralogs have been identified, spread out on six different chromosomes (Greenwold 
and Sawyer 2010). The diversification in and within several subfamilies of this 
multigene family via recombination, deletion, and duplication events has led to the 
subsequent evolution of claws and beaks from the scale subfamily and further the 
feather and feather-like subfamily. Furthermore, the first large-scale sequencing 
efforts targeting 48 bird species spanning all major avian clades has revealed a 
higher number of opsin genes than in mammalian genomes and likewise found a 
relationship between avian tetrachromatic vision and the presence of two to three 
conopsin genes (Zhang et al. 2014). Another example for how new technologies are 
aiding the discovery and characterization of gene families in non-model organisms is 
provided in Larsen et al. (2014). They used long-read sequencing to investigate the 
vomeronasal gene receptors (a highly diverse gene family in mammals) and 
concluded that previous assemblies have underestimated the diversity of this com¬ 
plex region and that long-read sequencing will greatly improve characterization of 
gene families in other non-model organisms. 

Copy number variation describes the differences in number of a certain chromo¬ 
somal segment (e.g., a gene) between any two individuals. CNVs have been 
extensively studied in birds, particularly in the field of poultry genetics. The majority 
of studies have used hybridization arrays to detect CNVs, but also SNP-chips and 
whole-genome sequencing have been applied (Wang and Byers 2014). It seems that 
CNVs are an abundant source of genetic variation in chicken, with a total of 3961 
reported CNVs contained in 1876 CNV regions (Wang and Byers 2014). This 
accounts for 8.3% and 9.6% of the chicken genome and the ordered assembly, 
respectively. In a study involving 16 species of birds, Skinner et al. (2014) found 
790 cross-species CNVs in 135 unique regions, which refutes the previous assump¬ 
tion that CNVs affect smaller regions and occur less frequent in birds than in 
mammals. They supported the notion that microchromosomes exhibit more CNVs 
than macrochromosomes, a pattern that interestingly is still observed in derived 
macrochromosomes consisting of fused ancestral microchromosomes. This suggests 
that not size but rather sequence-related features of microchromosomes are respon¬ 
sible for increased meiotic recombination, leading to a higher frequency of CNVs. 
Furthermore, they showed that 70% of detected CNVs contain genes, implying 
functional importance (Skinner et al. 2014). 

CNVs have been shown to have direct consequences for the phenotype of the 
carrier; striking examples in chicken are the silkie phenotype (hyperpigmentation of 
skin and connective tissue), which is caused by an inverted duplication and junction 



112 


M. H. Weissensteiner and A. Suh 


of genomic regions more than 400 kb apart (Dorshorst et al. 2011), and the pea comb 
phenotype (reduced comb and wattles), which can be traced back to a massive 
expansion of a duplicated intronic sequence of the SOX5 transcription factor (Wright 
et al. 2009; Vignal and Eory 2019). Clearly, the evolution of copy number variation 
is highly dynamic with a massive effect on the phenotype, likely much more so than 
genetic variation in single nucleotides (Wang and Byers 2014). 


4 Chromosomal Distribution of Genomic "Dark Matter" 

Nearly all bird species possess karyotypes with several large macrochromosomes 
and numerous small microchromosomes (reviewed by Kapusta and Suh 2017; 
Organ et al. 2008; Burt 2002; Damas et al. 2019; Griffin et al. 2019). Irrespective 
of their size, eukaryotic chromosomes are generally characterized by the presence of 
a centromere and two telomeres, typically highly repetitive regions with important 
functions (reviewed by Mekhail and Moazed 2010) and often tied to local suppres¬ 
sion of recombination (reviewed by George and Alani 2012). This together with the 
assumption that one meiotic crossover is obligate per chromosome explains that the 
meiotic recombination landscape is highly heterogeneous in birds, with 
low-recombination regions near putative centromeres and generally higher recombi¬ 
nation rates on smaller chromosomes (Kawakami et al. 2014; Backstrom et al. 2010; 
Stapley et al. 2010). As increased levels of meiotic recombination increase the 
efficacy of selection in purging slightly deleterious mutations (Ellegren and Galtier 
2016), it is not too surprising that microchromosomes are generally less repetitive 
than macrochromosomes (Fig. 9). Microchromosomes also have high gene densities 
(Ellegren 2010; Organ and Edwards 2011; Burt 2002), rendering mutations such as 
expansions of repetitive elements effectively deleterious in many sequence contexts. 
On the other hand, the Z and W sex chromosomes have high and very high repeat 
densities, respectively (Fig. 9), which correspond to their levels of meiotic recombi¬ 
nation. Except for the pseudoautosomal region (PAR), the Z chromosome 
recombines only in males, and the W chromosome is non-recombining and 
female-specific (Smeds et al. 2014). Additionally, many microchromosomes and 
macrochromosomes contain highly repetitive regions with repeat densities of up to 
-50% near centromere-associated assembly gaps (Fig. 9). 

Most of these repeat-rich regions only became visible in the newest chicken 
galGal5 assembly, the only available chromosome-level bird genome assembly 
based on long-read sequencing (Warren et al. 2017). This extends the observation 
from Sanger and short-read sequencing that avian genomes are repeat-poor and 
evolutionarily stable, i.e., show very few chromosomal rearrangements (Ellegren 
2010). Kapusta and Suh (2017) recently noted that such a compartmentalization of 
avian genomes into repeat-poor/gene-rich/stable regions and repeat-rich/gene-poor/ 
instable regions (Fig. 9) is reminiscent of the “two-speed genome” model in plant- 
pathogenic fungi (Raffaele and Kamoun 2012). In the following, we discuss those 
genomic regions which are particularly enriched in repetitive elements and/or 
constitute genomic “dark matter.” These are either parts of chromosomes such as 
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Fig. 9 The heterogeneous landscape of repetitive elements and assembly gaps in the third- 
generation chicken genome. Selected chromosomes are shown from a figure from Kapusta and 
Suh (2017), used with permission from John Wiley and Sons 


centromeres and telomeres or entire chromosomes such as the female-specific W 
chromosome, the supernumerary B chromosome, and the immune gene-rich chro¬ 
mosome 16. 


4.1 Distribution Within Chromosomes 

4.1.1 Centromeres 

Centromeres are structural chromosomal features which ensure the faithful segrega¬ 
tion of chromosomes during meiosis and mitosis (Graur and Li 2000). They do so by 
providing the substrate for spindle fiber attachment which is responsible for the 
poleward migration of chromosomes, a process essential for proper cell division. 
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Fig. 10 Shifted centromere 
positions between chicken and 
Japanese quail. Fluorescent in 
situ hybridization was used to 
compare centromere positions 
in giant lampbrush 
chromosomes. In both 
orthologous chromosomes 
14 and 15, the position of the 
centromere differs due to de 
novo centromere formation 
and/or accumulation of 
heterochromatin. Figure from 
Zlotina et al. (2012), used with 
permission from Springer 
Nature 
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Nonfunctional or surplus centromeres result in cellular failure or in viable gametes 
(Thompson et al. 2010). Although a crucial and ubiquitous feature of chromosomes, 
the centromere has no clear definition from a DNA point of view. The general rule is 
that the region at which centromeric proteins attach during meiosis exhibits at least 
some repetitive DNA sequences, but most studies have shown that functional 
centromeres are rather epigenetically determined and can form at any given location 
on a chromosome (Plohl et al. 2014). 

In chicken, the units of centromere-associated tandem repeat arrays are 
chromosome-specific, but on three chromosomes (5, 27 and Z) the centromere is 
short (-30 kb) and organized in a non-tandem repetitive fashion (Shang et al. 2010). 
Melters et al. (2013) have assessed the most abundant tandem repeats in 282 species 
of plants and animals and argued that those likely constitute centromeric repeats. 
Among the four species of birds investigated in that study (chicken, mallard, zebra 
finch, and gray partridge), repeat unit lengths ranged from 42 to 191 bp and GC 
content from 38 to 56%, suggesting a highly dynamic evolution of centromere DNA 
and very few general rules regarding size or sequence composition. It seems likely 
that heterochromatic tandem repeat arrays also play a role in the overall positioning 
of centromeres. For example, in quail (Fig. 10), despite evident chromosomal 
inversions compared to chicken, centromere positions are shifted due to the forma¬ 
tion of new centromeres and subsequent accumulation of centromeric tandem 
repeats (Zlotina et al. 2012). 

It is important to note that there is currently a conspicuous gap between results 
from classical cytogenetic studies (e.g., mentioned in the previous paragraph) and 
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Fig. 11 Anchoring large, presumably heterochromatic tandem repeat arrays in the hooded crow 
genome assembly. Using optical mapping, large-scale tandem repeat arrays (“repetitive anchored 
maps” or RAMs) can be localized in short-read and long-read genome assemblies. Vertical lines in 
boxes denote recognition motifs of a nicking endonuclease which can be visualized as a specific 
“barcode”-like pattern. Note that the tandem repeat array is entirely absent in the short-read 
assembly (dark green) and only partly assembled in the long-read assembly (light green), whereas 
the two optical map assemblies (blue) are capable of visualizing the tandem repeat array to a much 
larger extent. This is because optical mapping uses much longer (>150 kb) single DNA molecules 
as input, thus containing much more long-range information. Figure from Weissensteiner et al. 
(2017), used under CC BY-NC 4.0 license 


more recent whole-genome sequencing efforts. Although in principle present in 
high-throughput sequencing data sets, centromeric repeats are usually not assembled 
(cf. Fig. 1), and thus the positions of centromeres are not included in genome 
assemblies of non-model organisms. This is not surprising given the highly repeti¬ 
tive nature of these regions, but certainly not desirable, since centromeres play a 
fundamental role in genome biology and their positions thus should be incorporated 
in downstream analyses. A promising development is the emergence of new 
technologies which facilitate the long-range information from single DNA 
molecules, e.g., long-read sequencing (Eid et al. 2009) or nanochannel optical 
mapping (Lam et al. 2012). Only recently, it has become possible to assemble the 
linear sequence of one entire centromere in humans (Jain et al. 2018), a task which 
has yet to be completed in birds. However, it is at least possible to anchor large 
potentially centromeric tandem repeat arrays into genome assemblies through optical 
mapping (Fig. 11) (Weissensteiner et al. 2017). 

4.1.2 Telomeres 

Like centromeres, telomeres represent crucial structural chromosomal features in 
eukaryotic genomes. Their main purpose is to protect ends of chromosomes from 
deleterious alterations via enzymes or to prevent breakage and fusion with other 
chromosomes during replication in mitosis and meiosis (Blackburn et al. 2006). 
While centromeres show diverse organizations with DNA sequences differing even 
between chromosomes of the same species (Plohl et al. 2014), the general structure 
of telomeres is extremely conserved among eukaryotes. The basic feature is a 
microsatellite of six nucleotides “TTAGGG,” which is organized in a tandem repeat 
array with different lengths and associated with specific proteins (TRFs, “telomere 
repeat binding factors”) (Bilaud et al. 1997). A striking feature of the telomere is that 
the length of the array is species-, individual-, chromosome-, and age-specific, 
meaning that it is one of the most dynamic regions of the genome (Monaghan 
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2010). Due to the nature of DNA replication, the array gets shortened in each 
replication cycle, leading to those differences in telomere array length depending 
on age. Telomerase, a ribonucleoprotein which carries its own RNA molecule, 
elongates shortened telomere arrays after replication via reverse transcription and 
thus acts against this telomere shortening (Blackburn et al. 2006). Interestingly, the 
rate of the shortening is variable even within a species and also depends on 
environmental conditions. For example, Asghar et al. (2015) found that great reed 
warblers chronically infected with avian malaria have on average shorter telomeres 
and also produce offspring with shorter telomeres in early life. Thus, it seems that 
differences in fitness depending on environmental conditions are mediated by 
differential telomere degradation and elongation. 

Overall, birds have a high variation in telomere array length, with the largest 
individual arrays of any vertebrate found in chicken (up to 2 Mb, Delany et al. 2003; 
Monaghan 2010). Also, the content of telomeric sequence proportional to the entire 
genome is ten times larger in chicken compared to human. The latter might stem 
from the fact that in birds there is abundant occurrence of “TTAGGG” repeats far 
away from chromosome ends, and some (micro)chromosomes even show telomere 
arrays larger than the rest of the chromosome (Nanda et al. 2002). Furthermore, 
mega-telomere arrays are present on four chicken chromosomes (9, 16, 28, and W) 
in which three of them account for an entire chromosome arm (Delany et al. 2007). 
Despite the development of computational tools for estimating telomere array length 
from high-throughput sequencing (Ding et al. 2014; Nersisyan and Arakelyan 2015), 
to date there is no study on an avian system adopting this approach. 


4.2 Distribution Between Chromosomes 
4.2.1 W Chromosomes 

Birds have Z and W sex chromosomes which evolved from an ancestral pair of 
autosomes (Fridolfsson et al. 1998). The female-specific W chromosome is 
non-recombining except for the pseudoautosomal region (PAR), which is identical 
to the Z, and normally pairing and recombining during meiosis (Smeds et al. 2014). 
Similar to the XY sex chromosomes of mammals, ZW sex chromosome differentia¬ 
tion has been commonly explained via recombination suppression through 
inversions (Charlesworth and Charlesworth 2000) and is generally followed by 
degeneration of the non-recombining sex chromosome (W or Y) through accumula¬ 
tion of repetitive elements (reviewed by, e.g., Charlesworth et al. 2005; Mank 2012; 
Abbott et al. 2017). However, although the sex chromosomes of Neognathae 
(Neoaves + Galloanserae; Fig. 3) and tinamous are visibly differentiated (“hetero- 
morphic”) (Mank and Ellegren 2007), ratites such as ostrich and emu exhibit largely 
undifferentiated (“homomorphic”) sex chromosomes (Vicoso et al. 2013; Zhou et al. 
2014; Yazdi and Ellegren 2014). Accordingly, genomic approaches suggest that the 
ZW-recombining PAR of ratites spans two thirds of the sex chromosome pair 
(Vicoso et al. 2013; Zhou et al. 2014), in contrast to neognaths and tinamous 
where the PAR is very small (except for the tropic bird, Zhou et al. 2014). For 
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example, the PAR is only 630 kb in collared flycatcher and contains only 22 protein¬ 
coding genes (Smeds et al. 2014). To our knowledge, the smallest avian PAR is that 
of the chicken as it contains no genes but only telomeric and subtelomeric repeats 
(Bellott et al. 2017). 

Why the sex chromosomes of most birds are highly differentiated and others have 
largely undifferentiated sex chromosomes is a question yet to be answered (Wright 
et al. 2016). What is known is that sex chromosome differentiation happened in a 
stepwise manner throughout avian evolution, as suggested by the presence of three 
or four evolutionary strata on the Z of neognaths and tinamous (Zhou et al. 2014; 
Suh et al. 2011a; Smeds et al. 2015; Nam and Ellegren 2008) and two evolutionary 
strata on the Z of ratites (Vicoso et al. 2013; Zhou et al. 2014). Consequently, sex 
chromosome differentiation into multiple evolutionary strata happened indepen¬ 
dently not only in neognaths and tinamous (Mank and Ellegren 2007) but also 
within neognaths throughout the diversification of Neoaves and Galloanserae 
(Stiglec et al. 2007; Zhou et al. 2014). This evidence from avian genomics extends 
earlier cytogenetic observations that avian W chromosomes did not gradually 
become smaller over evolutionary time but are rather highly variable in size even 
over short timescales, probably due to repeat expansions or contractions (Rutkowska 
et al. 2012). This is exemplified by W size extremes among closely related 
songbirds. The Z chromosome is generally very stable in size across the avian tree 
of life, and the ZW size ratio is on average 1.9 in birds, with a range from 0.8 in the 
crimson finch Neochmia phaeton to 4.4 in the common tailorbird Orthotomus 
sutorius (Rutkowska et al. 2012). Interestingly, Rutkowska et al. (2012) further 
noted that birds with smaller genomes have smaller W chromosomes, which is in 
line with the observation by Kapusta et al. (2017) that birds with smaller genomes 
have more TE activity and higher genome shrinking at the same time (see Sect. 5.1). 
We therefore suggest that a combination of different degrees of differentiation 
(leading to W shrinkage) and different W repeat expansions (leading to W growth) 
were at play during avian evolution. The exact mechanisms have yet to be 
elucidated, but it is almost certain that repetitive elements played a significant role 
in sex chromosome differentiation. 

As mentioned above (see Sect. 4), the W chromosome is by far the most repetitive 
chicken chromosome (Fig. 9) and thus notoriously difficult to assemble, although 
some other chicken chromosomes remain largely or entirely unassembled (Kapusta 
and Suh 2017; Warren et al. 2017). Analyses of the W repeat content are so far 
limited to two bird species, namely, the long-read chromosome-level assembly of 
chicken (Fig. 9) (Bellott et al. 2017) and the short-read scaffold-level assembly of 
collared flycatcher (Smeds et al. 2015). In both cases, the assembled sequence 
contains TE densities of over 50%, with an over eightfold enrichment in LTR 
retrotransposons compared to the autosomes (cf. Fig. 9). The W chromosome can 
thus be considered as a “refugium” for retrovirus-like elements (Kapusta and Suh 
2017). To our knowledge, chicken and collared flycatcher are the only W assemblies 
so far analyzed with in-depth repeat annotations, because such annotations are so far 
restricted to chicken, zebra finch, and collared flycatcher (reviewed by Kapusta and 
Suh 2017; Suh et al. 2018). Although assemblies of W-chromosomal sequence are 
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Fig. 12 The complex repetitive structure of the chicken W chromosome. The W contains seven 
cytogenetically distinct regions (“chromomeres”), of which the three euchromatic chromomeres 
(yellow and blue) are mostly assembled by now (cf. Fig. 9) (Bellott et al. 2017). Figure modified 
from Bellott et al. (2017), used with permission from Springer Nature 

available for 17 additional bird species (Zhou et al. 2014; Davis et al. 2010), none of 
these have been annotated for lineage-specific repetitive elements, and repeat 
densities would thus likely remain vastly underestimated, especially for LTRs 
(Kapusta and Suh 2017). 

What about other types of repetitive elements on the W? In contrast to mammals 
with many gene-containing regions on the Y chromosome which are either tandemly 
repeated with several copies next to each other (“ampliconic”) or present as inverted 
copies (“palindromes”) (Skaletsky et al. 2003; Soh et al. 2014; Bellott et al. 2014; 
Hughes et al. 2010; Bachtrog 2013), the only known ampliconic gene on the avian 
W chromosome is HINTW, with -40 copies of a 5 kb repeat unit in chicken (Bellott 
et al. 2017) and high between-copy sequence similarity (Backstrom et al. 2005). 
Further signatures of gene conversion have been detected in a 22 kb palindrome on 
the W chromosome of the white-throated sparrows (Davis et al. 2010). Together, 
these ampliconic and palindromic structures permit gene conversion on an otherwise 
non-recombining chromosome, which leads to decreased degeneration and increased 
potential for adaptation of the genes involved (Betran et al. 2012). 

To conclude this section, we wondered: How much genomic “dark matter” is 
there on the W? Most suitable for this thought experiment was the chicken W 
chromosome of which -7 Mb of euchromatic sequence have been recently assem¬ 
bled (Fig. 12) (Bellott et al. 2017). We used RepeatMasker (Smit et al. 1996-2010) 
to annotate this assembly and determined a total repeat density of -73%. Notably, 
14% of the assembled sequence are CR1 retrotransposons, and 57% are LTR 
retrotransposons. Assuming a chicken ZW size ratio of 2.14 (Rutkowska et al. 
2012) and a Z chromosome size of >82 Mb (Warren et al. 2017), the “true” W 
size should be >38 Mb, which means that >31 Mb (or >82%) of the W are currently 
constituting genomic “dark matter.” Bellott et al. (2017) suggested that large parts of 
the chicken W chromosome are heterochromatic, each containing tandem arrays of 
one of three specific satellite repeats, the 0.5 kb repeat Sspl (Itoh and Mizuno 2002), 
the 1.1 kb repeat Xhol , and the 1.2 kb repeat EcoRI (Saitoh and Mizuno 1992; 
Solovei et al. 1998). We therefore extrapolated that the currently missing W 
sequence consists entirely of these satellite repeat arrays and telomeric tandem repeat 
arrays, which would yield a “true” total repeat density of -95%. More precisely, the 
chicken W chromosome would then consist of -82% tandem repeats, -10% LTRs, 
-3% CRls, and only -5% non-repetitive sequence. What is the W chromosome, 
really—a sex-specific chromosome or a refugium for repetitive elements? 
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4.2.2 B Chromosomes 

Contrary to the widespread assumption that all cells or all individuals of a species 
contain essentially the same set of chromosomes, it has been known for nearly a 
century that organisms may have “supernumerary” B chromosomes in addition to 
the “standard” A chromosomes (Randolph 1928). These B chromosomes contain 
large amounts of repetitive DNA, are considered to persist as genomic parasites 
through meiotic drive, and have been reported from over 500 animal species 
(reviewed in (Camacho 2005; Burt and Trivers 2006). For eukaryotes, Camacho 
(2005) estimated that approximately 15% of all karyotyped species exhibit B 
chromosomes. In the classical definition, B chromosomes are present in only some 
individuals of a species and vary in population frequency over time (Camacho 
2005). There are, however, rare cases where B chromosomes are stably inherited 
within species but consistently eliminated from somatic tissues. These are germline- 
restricted chromosomes (GRCs) (reviewed in Wang and Davis 2014). Such a cell- 
type-specific B chromosome “maintains its evolutionary viability by preserving its 
vertical transmission, but can prevent the most harmful effects from manifesting in 
the host’s soma” (Camacho 2005). 

To date, B chromosomes have been cytogenetically studied in great detail in two 
bird species from the Estrildidae family, zebra finch (Pigozzi and Solari 1998) and 
Bengalese finch (del Priore and Pigozzi 2014). In both cases, the B chromosome is 
maintained in the germline but eliminated from pre-somatic cells during embryo- 
genesis. Notably, these GRCs are the only known cases within vertebrates apart from 
those of hagfishes and lampreys, which shared a common ancestor with birds -550 
million years ago (Smith 2017). The GRCs of zebra finch and Bengalese finch are 
stably inherited through the female germline where they are present in two euchro- 
matic and recombining copies and are present in the male germline as a single 
heterochromatic chromosome which is eliminated from spermatocytes during meio- 
sis (del Priore and Pigozzi 2014; Itoh et al. 2009; Goday and Pigozzi 2010; 
Schoenmakers et al. 2010; Pigozzi and Solari 2005). It has further been noted that 
the zebra finch GRC is silenced in spermatocytes similar to the W chromosome in 
oocytes (Schoenmakers et al. 2010). Overall, a complex model for GRC trans¬ 
mission has emerged through cytogenetics, but some details remain unknown (see 
Fig. 11 of Itoh etal. 2009). 

In addition to their enigmatic mode of transmission, the GRCs of zebra finch and 
Bengalese finch are notable for the fact that they are larger than the largest A 
chromosome of the respective karyotype (del Priore and Pigozzi 2014; Pigozzi and 
Solari 1998) (Fig. 13). Considering that chromosome 2 is the largest chromosome of 
the zebra finch reference genome (Warren et al. 2010), this would suggest that these 
two GRCs are each >156 Mb in size. Only an intergenic fragment of 18.8 kb of the 
zebra finch GRC has so far been sequenced, suggesting that at least this part of the 
GRC is derived from chromosome 1 and potentially amplified to high copy number 
(Itoh et al. 2009), as well as the mRNA of a single GRC-linked gene (Biederman 
et al. 2018). Although it may seem surprising that the current era of avian genomics 
has so far neglected to sequence these GRCs which are equivalent in size to -13% of 
the somatic genome (Table 1) (but see recent genome and expression data in Kinsella 
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Fig. 13 The germline-restricted chromosome (GRC; arrowhead) of the zebra finch, (a) Partial 
karyotype of the eight largest chromosome pairs in female somatic cells, (b) Partial karyotype of the 
eight largest chromosome pairs in male somatic cells, (c) Partial karyotype of the eight largest 
chromosome pairs and the single GRC in male germline cells. The scale bar is 10 pm. Figure from 
Pigozzi and Solari (1998), used with permission from Springer Nature 


et al. 2018), genomic data for B chromosomes are scarce in general. B chromosome 
assemblies of vertebrates are, to our knowledge, limited to a single cichlid fish 
species (Valente et al. 2014) and zebra finch (Kinsella et al. 2018). Finally, it is 
well possible that GRCs are widespread among (song)birds but have been 
overlooked because they are not as prominent in size as those of zebra finch and 
Bengalese finch (Fig. 13) (Kinsella et al. 2018; Torgasheva et al. 2018). We therefore 
conclude this section with the words of Smith (2017) on GRCs: “Even for the vast 
majority of species with ‘sequenced’ genomes, we simply haven’t looked.” 


4.2.3 Chromosome 16 

Chromosome 16 seems to be one of the most elusive among the avian chromosomes, 
continuously withstanding assembly efforts. Even in the newest chicken genome 
assembly (galGal5; based on long-read sequencing), only a fraction of the entire 
chromosome is assembled (Warren et al. 2017). Several chromosomal properties— 
with repetitive DNA as a common feature—are the likely reason for that (Fig. 14). 
Firstly, chromosome 16 bears a mega-array of telomeric repeats, occupying the 
shorter arm of the chromosome (Delany et al. 2007). Then there are two domains 
which harbor the major histocompatibility complexes MHC-B and MHC-Y on the 
long chromosome arm. These are thought to be one of the genetically most diverse 
regions in the avian genome due to extreme variability in copy number (Hess and 
Edwards 2002). The MHC-B and MHC-Y regions are separated by an array of the 
P041 repeat, a 41-bp-long and (G + C)-rich sequence which also acts as a recombi¬ 
nation hotspot, rendering the two MHC regions essentially unlinked (Solinhac et al. 
2010). However, the biggest obstacle for sequencing efforts is perhaps the nucleolus 
organizer region (NOR). This chromosomal region is responsible for the initiation of 
the nucleolus, a structure in the cell nucleus responsible for the formation of 
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Fig. 14 A physical 
representation of genes and 
intergenic regions on chicken 
chromosome 16. The MHC 
(major histocompatibility 
complex) is split into two 
(MHC-B and MHC-Y) 
regions. Separated by a large 
array of GC-rich P041 
repeats, the two MHC regions 
exhibit elevated 
recombination between them. 
Note also the large nucleolar 
organizer region (NOR), 
which occupies 
approximately 5-7 Mb on the 
q-arm and contains -150 
copies of ribosomal RNA 
genes. Figure taken from 
Miller and Taylor (2016), 
used under CC BY license 
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ribosomes and thus of extreme functional importance (Graw 2015), consisting of 
enormous tandem repeat arrays of ribosomal RNA gene copies. Individual copies are 
between 11 and >50 kb in size, while copy numbers range from 279 to 268. Since 
this translates to 5-7 Mb of repeated DNA with high sequence identity among repeat 
copies (Delany and Krupkin 1999), it is therefore not surprising that assembly efforts 
had limited success so far (Warren et al. 2017). 

Intriguingly, most of the (known) genes on chromosome 16 seem to be tied to the 
immune system, for example, coding for peptide antigen presentation (MHC-B) or 
the CDl genes, which encode lipid-antigen-binding molecules (Miller and Taylor 
2016). Together with the presence of recombination hotspots increasing allelic 
diversity, chromosome 16 seems to be an ideal substrate for the evolution of 
immune-related genes due to its repetitive nature, and an improved assembly of 
this chromosome would certainly benefit various aspects of avian genomics. 


5 Evolutionary Implications of Genomic "Dark Matter" 

5.1 Accumulation of Repetitive Elements 

While tandem repeats have been largely inaccessible to avian comparative genomics 
as they are mostly unresolved as genomic “dark matter” (see Sect. 3), in-depth 
characterizations of assembled interspersed repeats have revealed lineage-specific 
events of de novo emergences and horizontal transfers (see Sect. 2). In addition to a 
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patchy distribution of some interspersed repeats across the avian tree of life (Fig. 3) 
and assembly issues with very young TEs aside (Fig. 1), bird genomes vary 
considerably in their long-term TE accumulation and TE turnover via deletions 
(Fig. 15) (Kapusta et al. 2017; Kapusta and Suh 2017). This may seem counterintui¬ 
tive given that not only assembly sizes but also TE copy numbers and densities vary 
only moderately in avian genomes (assembly sizes of 1.0-1.3 Gb; 130,000-350,000 
TE copies or 4.1-9.8% TE densities, except for the downy woodpecker with a TE 
density of 22.2%) (Kapusta and Suh 2017; Zhang et al. 2014) but were readily 
explained by the “accordion model” of genome size maintenance in birds and 
mammals (Kapusta et al. 2017). Accordingly, insertion rates and deletion rates 
covary in birds, with lineages experiencing more TE accumulation also undergoing 
a higher turnover of TEs via deletions (Fig. 15). The most extreme examples are 
ostrich and penguins with very low rates of TE accumulation and deletion on the one 
end of the spectrum and zebra finch, downy woodpecker, and Anna’s hummingbird 
on the other end (Kapusta et al. 2017). Surprisingly, this implies that secondarily 
flightless birds have larger genomes not due to increased TE accumulation but due to 
reduced shrinking compared to flying birds. Likewise, the smaller genomes of flying 
birds are probably the result of increased shrinking due to higher TE activity and thus 
increased genomic instability (Kapusta and Suh 2017). 

Below, we more specifically address the accumulation of repetitive elements in 
the context of genetic diversity, phylogenetic markers, and large-scale 
rearrangements. 


5.1.1 Genetic Diversity 

Like all types of mutations, changes in tandem repeat array length or insertions of 
interspersed repeats will initially exist as polymorphisms until reaching fixation in a 
population. Notably, genetic diversity in repetitive elements may have a larger 
impact on phenotypic variation than that of single-nucleotide polymorphisms 
(Huddleston and Eichler 2016). While genetic diversity has been quantified in 
several avian species for single-nucleotide polymorphisms (e.g., Dutoit et al. 2017; 
Vijay et al. 2017) and microsatellites (see Sect. 3), the study of population-level 
variation in large tandem repeats and interspersed repeats is still in its infancy in 
avian genomics. Virtually nothing is known about length variation in tandem repeat 
arrays in birds, although methods exist to do so (Wei et al. 2014; Weissensteiner 
et al. 2017). On the other hand, single cases of TE presence/absence polymorphisms 
(TEVs; Fig. 16a) have so far been reported for chicken and grebes, respectively (Lee 
et al. 2017; Suh et al. 2012). A very recent genome-scale study of TEVs across 
200 flycatcher genomes reported over 10,000 TEVs segregating within and between 
the four sampled flycatcher species and suggesting recent activity of at least eight 
distinct families of retrovirus-like LTR retrotransposons (Suh et al. 2018). We 
emphasize that these estimates of genetic diversity through TEVs are likely 
underestimates, given that the flycatcher reference assembly and re-sequencing 
data are all based on short-read sequencing and thus only allow for detection of 
TEVs in the well-assembled, non-repetitive part of the genome (Fig. 16b). 
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Fig. 15 The temporal landscape of TE accumulation across the avian tree of life. Dated TE 
landscapes are shown as stacked histograms plotted on the Jarvis et al. (2014) phylogeny. The 
major groups of TEs are non-LTR retrotransposons (green; mostly LINEs), LTR retrotransposons 
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Fig. 16 Discovery of TE presence/absence polymorphisms from re-sequencing data, (a) Sche¬ 
matic illustration of a situation where the “true” genome of a re-sequenced bird individual differs 
from the reference genome in the presence of a TE insertion (red) in an orthologous target site 
(white square), (b) Read mapping against the reference will result in properly mapped read pairs 
(blue) and discordant read pairs (orange) where one of the reads has sequence similarity to a 
TE. Figure from Suh et al. (2018), used under CC BY license 


5.1.2 Phylogenetic Markers 

Rare genomic changes (RGCs) are “complementary markers with enormous poten¬ 
tial for molecular systematics” (Rokas and Holland 2000). The most widely used 
type of RGCs in phylogenetics is presence/absence patterns of retrotransposon 
insertions (cf. Fig. 16a) which have been shown to very rarely exhibit homoplasy 
(reviewed by, e.g., Ray et al. 2006; Shedlock et al. 2004; Han et al. 201 1). In contrast 
to other types of RGCs, retrotransposon presence/absence patterns can be polarized 
without outgroups because the retrotransposition mechanism yields an insertion 
flanked by a target site duplication as the derived state, while an empty insertion 
site constitutes the ancestral state (Fig. 2a). Concurrent with the availability of more 
and more genome assemblies, studies have shifted from sampling several or dozens 
of retrotransposon markers across species via polymerase chain reaction and Sanger 
sequencing (e.g., Kaiser et al. 2007) to now sampling hundreds or thousands of 
markers from available genome sequencing data (e.g., Suh et al. 2015b, Cloutier 
etal. 2018). 

Retrotransposon markers (sampling CRls, LTRs, or SINEs) have been applied to 
a wide range of birds ranging from Galliformes (Kaiser et al. 2007; Kriegs et al. 
2007; Liu et al. 2012b) over Passeriformes (Treplin and Tiedemann 2007; Suh et al. 
2017, 2018) to other groups of birds such as grebes, penguins, storks, and 
woodpeckers (Suh et al. 2012; Watanabe et al. 2006; Kuramoto et al. 2015; Han 
et al. 2011). These studies were largely congruent with sequence-based phylogenies. 
Furthermore, retrotransposon markers have helped settle long-standing phylogenetic 
disputes such as the paraphyly of ratites within Palaeognathae (Baker et al. 2014; 
Haddrath and Baker 2012; Cloutier et al. 2018) and several critical branches within 
the explosive radiation of Neoaves (Suh et al. 2011b, 2015b; Matzke et al. 2012), 


Fig. 15 (continued) (purple), DNA transposons (blue), and other interspersed repeats (black; 
mostly unclassified). Species names in gray are those with low-quality genome assemblies. 
Figure modified from Kapusta and Suh (2017), used with permission from John Wiley and Sons 
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Fig. 17 Retrotransposon markers suggest that the root of the neoavian radiation remains unre¬ 
solved (cf. Fig. 3). Comparison of a pruned neighbor network of retrotransposon markers (left) with 
a neighbor network of a simulated hard polytomy, i.e., simultaneous speciation (right). Nine 
reciprocally monophyletic taxa emerge from this potential hard polytomy (yellow), namely, 
Aequomithes/Phaethontimorphae (Ae/Ph), Caprimulgiformes (Ca), Charadriiformes (Ch), 
Columbimorphae (Co), Gruiformes (Gr), Opisthocomiformes (Op), Mirandomithes (Mi), 
Otidimorphae (Ot), and Telluraves (Te). Figure modified from Suh (2016), used under CC BY 
4.0 license 

such as the sister group relationship between parrots and passerines. Notably, a 
reanalysis of genome-level retrotransposon markers suggested that most of the 
neoavian radiation is now reliably resolved (congruent with the genome-level 
sequence phylogeny of Jarvis et al. 2014), except for the eight deepest speciation 
events which remain unresolved (Fig. 17) (Suh 2016; but see Braun et al. 2019). 

5.1.3 Large-Scale Rearrangements 

Repetitive elements can mediate large-scale rearrangements such as inversions, 
deletions, duplications, and translocations (Konkel and Batzer 2010; Devos et al. 
2002). This happens through ectopic recombination between homologous sequences 
(cf. Fig. 6b) and can thus be expected to be more likely to occur between repeat-rich 
regions of bird genomes (cf. Fig. 9). Additionally, rapid amplification of specific TE 
families can provide genome-wide interspersed substrates for ectopic recombination 
and thus increased genomic instability (reviewed by, e.g., Kapusta et al. 2017). In 
line with this, genomic studies of deep evolutionary timescales of avian chromosome 
evolution have shown that breakpoints of large-scale rearrangements are biased 
toward regions enriched in repetitive elements, especially TEs (Farre et al. 2016; 
Skinner and Griffin 2012). Furthermore, high numbers of large-scale 
rearrangements, in particular inversions, have been reported from the zebra finch 
and other Estrildidae (Hooper and Price 2015; Romanov et al. 2014; Warren et al. 
2010), which coincides with a higher rate of expansion and diversification of 
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retrovirus-like LTR retrotransposons than in other songbird lineages (Kapusta and 
Suh 2017; Suh etal. 2018). 

On recent evolutionary timescales, inversions are frequently invoked to explain 
patterns of linkage disequilibrium in population genomics (e.g., zebra finches and 
flycatchers, Knief et al. 2016; Ellegren et al. 2012). However, physical (i.e., 
assembly-based) evidence for intraspecific inversions is rare in avian genomics. To 
our knowledge, the only cases where inversion polymorphisms have been assembled 
or at least their breakpoints determined are the ruff (Kupper et al. 2016; 
Lamichhaney et al. 2016) and the white-throated sparrow (Davis et al. 2011; Tuttle 
et al. 2016). Both cases are intraspecific megabase-scale inversions with consider¬ 
able phenotypic effects on plumage and behavior, and were thus subject to large 
targeted efforts of sequencing and assembly. It is well possible that inversion 
polymorphisms are far more common in birds but currently constitute genomic 
“dark matter” because their repeat-rich breakpoints are difficult to assemble without 
targeted efforts and with short-read sequencing data alone. 

The same might apply to another type of large-scale rearrangements, the duplica¬ 
tion of sequence segments or even entire chromosome arms. Duplications are also 
known to potentially cause dramatic alterations of the phenotype and thus constitute 
the genomic substrate for evolution (Ohno 1970). To our knowledge, the only well- 
understood case of a duplication event in birds with direct phenotypic consequences 
is the duplex comb locus in chicken. A 20 kb duplication 200 kb upstream of the 
EOMES gene causes the V-shaped and buttercup comb phenotypes of certain 
chicken breeds (Dorshorst et al. 2015). 


5.2 Genomic Conflict and Speciation 
5.2.1 Host-Parasite Arms Races 

“From the perspective of genomic parasites, such as transposons and viruses, 
genomes of cellular organisms are not simply strings of DNA but complex 
microcosms of interactions between and among parasitic genes and host genes” 
(Kapusta and Suh 2017). 

The genome is in an everlasting arms race with the parasitic genes hidden within. 
Recent studies of human evolution highlight this importance, as host-virus arms 
races caused nearly a third of all adaptive protein changes (Enard et al. 2016) and 
were responsible for adaptive introgressions between Neanderthals and modern 
humans (Enard and Petrov 2017). Furthermore, the significance of host-TE arms 
races is becoming widely appreciated, with dozens of TE repressor mechanisms so 
far known from model organisms (reviewed by Kapusta and Suh 2017; Goodier 
2016). Accordingly, host defense can occur “before TE transcription (DNA methyl- 
ation, histone modifications, DNA hypermutation), during TE transcription (pre¬ 
mature termination), after TE transcription (RNA hypermutation, piRNA binding, 
siRNA binding), and before TE transposition (inhibition of ribonucleoprotein 
complexes)” (Kapusta and Suh 2017). 
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Fig. 18 Host-TE arms races may directly influence speciation. Illustration of the model of Rogers 
(2015) where reduced hybrid fitness results from transposon derepression in some gametes follow¬ 
ing meiotic recombination of incompatible TE repressor systems. Figure loosely drawn after Rogers 
(2015) 


To our knowledge, the evolutionary significance of arms races between birds and 
transposons/viruses has been overlooked, since only three studies have investigated 
such repression mechanisms in birds. Firstly, a study on DNA methylation of TEs in 
great tits suggested that non-CpG methylation is responsible for silencing of most 
TE copies (Derks et al. 2016), but it has been noted that an annotation of TEs specific 
to the great tit lineage is so far lacking (Kapusta and Suh 2017). Secondly, a study on 
APOBEC cytosine deaminases among 123 sampled vertebrates reported that birds 
(especially zebra finch and medium ground finch) had the strongest signals of C-to-U 
hypermutation of LTR retrotransposons, a mechanism leading to defective LTR 
daughter copies (Knisbacher and Levanon 2015). Thirdly, two recent studies on 
PlWI-interacting RNAs (piRNAs) in chicken suggested that the mRNAs of all 
germline-transcribed TEs are targeted by specific piRNAs with complementarity to 
prevent (retro)transposition (Chang et al. 2018) and reported a de novo piRNA birth 
which controls avian leukosis virus infections in domestic but not wild chicken 
(Sun et al. 2017). 

It is plausible that the rapidity of host-TE/virus arms races directly influences 
speciation by contributing to reproductive isolation between new species (Fig. 18). 
Accordingly, reduced gene flow between two populations may lead to divergence of 
their TE landscapes and TE repressor systems which, on secondary contact, may 
result in reduced hybrid fitness or even hybrid sterility if the parental TE repressor 
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systems are incompatible (Rogers 2015; Erwin et al. 2015). Rogers (2015) suggested 
a possible scenario leading to TE derepression in the gonads of FI hybrids, which 
would be the unlinking of a TE repressor system during meiotic recombination and 
their placement in separate gametes (Fig. 18). To our knowledge, there are no 
empirical examples yet for TE-induced reproductive isolation in birds. However, 
in Drosophila the divergence of parental piRNA pathways has been linked to TE 
derepression in FI hybrids (Romero-Soriano et al. 2017; Kidwell et al. 1977; 
Erwin et al. 2015). Based on these results, forward simulations of a new TE invading 
a population suggested that TE repression via small RNAs may evolve de novo 
within less than 200 generations (Kelleher et al. 2017). We therefore hypothesize 
that host-TE/virus arms races played an important role in bird speciation. This may 
be especially the case in songbirds as they were frequently invaded by novel 
retrovirus-like LTR retrotransposons (Warren et al. 2010; Suh et al. 2018) (see 
Sect. 2.1.3), a situation which likely led to rapid divergence of TE/virus repressor 
systems between songbird populations and species. 

5.2.2 Meiotic Drive 

Selection for a higher number of offspring can act on many different levels. 
Arguably the most basic level is at DNA sequence level, the very carrier of heritable 
information. Often seen as a process affecting individuals of a population, it is 
astounding that selection can also happen in a way that individual genes or 
chromosomes compete for a spot in the next generation. This process, termed 
“meiotic drive,” commonly exploits the asymmetry of female meiosis (Sandler and 
Novitski 1957). Out of four homologous chromosomes, only one is transmitted to 
the egg; the other three are banned to the polar bodies and do not contribute to the 
next generation. While usually a stochastic process with an even distribution of all 
four possible chromosomes, mechanisms can evolve that cause the chromosome 
carrying the meiotic driver to be preferentially transmitted (Lindholm et al. 2016). A 
well-known example of such an ultra-selfish genetic element is the knob system in 
maize (Buckler IV et al. 1999). Blocks of heterochromatic tandem repeats found 
distal from centromeres outcompete chromosomes without such elements. In the 
case of a heterozygote, the carrier chromosome wins the race for being included in 
the egg and thus gains a selective advantage. Sometimes the affinity for the centro- 
meric motor protein may depend on the length of a certain type of satellite array 
(Henikoff et al. 2001), demonstrating the importance of accurate assessment of 
repetitive elements. Due to negative effects of centromere drive in male meiosis 
(e.g., chromosomes failing to separate properly during cell division), which lead to 
infertility or inviability, it is expected that other components influencing segregation, 
such as protein binding affinity, evolve rapidly to restore segregation balance (Malik 
and Henikoff 2001). Direct observations of centromere drive are extremely rare 
(Lindholm et al. 2016), and thus far there are only two studies which reported 
segregation distortion in birds; for chicken, Axelsson et al. (2010) report indirect 
evidence for rapid evolution in kinetochore protein genes and determine two chro¬ 
mosomal regions which exhibit unfair segregation in 197 chicken families. Knief 
et al. (2015) found a prezygotic transmission distorter to be active in both males and 



Repetitive DNA: The Dark Matter of Avian Genomics 


129 


females of zebra finches, which suggests other mechanisms to disproportionately 
increase a certain genetic element fitness. We assume that these cases are just the tip 
of the iceberg of actual meiotic drives as their confirmation requires strong drivers or 
very large sample sizes. 


5.3 Meiotic Recombination and Gene Conversion 

The meiotic recombination landscape, i.e., the distribution of crossovers on 
chromosomes during meiosis, is usually highly heterogeneous in eukaryotes 
(de Massy 2013). Different factors influence this heterogeneity, among the most 
important are chromosome size, gene density, epigenetic marks, and structural 
chromosomal features such as centromeres and telomeres. 

The physical rupture of chromosomal DNA during meiosis (i.e., double-strand 
breaks) leads to the exchange of homologous chromosomes through homologous 
recombination (Graur and Li 2000). This causes a shuffling of allelic combinations 
and is an important source of genetic variation (Ellegren and Galtier 2016). How¬ 
ever, if a double-strand break occurs in a repetitive region, ectopic or nonallelic 
homologous recombination (NAHR) might occur (George and Alani 2012). Homol¬ 
ogous repetitive sequences, stemming from different chromosomal regions or even 
different chromosomes, align to each other and can induce chromosomal 
rearrangements via NAHR (cf. Fig. 6b). While sometimes neutral or even beneficial 
for the carrier, mutations such as large chromosomal rearrangements or chromosome 
number changes are mostly deleterious and lead to inviable or infertile offspring 
(Lupski and Stankiewicz 2005). It is thus likely that mechanisms evolve which 
prevent such mutations via suppression of recombination in these regions (George 
and Alani 2012). This is usually achieved by heterochromatization, meaning that the 
DNA molecule is being tightly wrapped around histone proteins so that otherwise 
open chromatin is transcriptionally suppressed and also not accessible for double¬ 
strand breaks (Grewal and Jia 2007). However, it seems paradoxical then that 
tandem repeat arrays evolve rapidly showing frequent expansions and contractions. 
A solution to this paradox is offered by the model of gene conversion. In this form of 
nonreciprocal genetic exchange, a double-strand break in one homologous chromo¬ 
some is followed by the invasion and subsequent insertion of an exact copy of the 
sequence of the homologous chromosome (see also Sect. 3.2 and Fig. 8b) (Talbert 
and Henikoff 2010). Via this mechanism, not only repeat expansions and 
contractions can be explained, but also the emergence of higher-order repeat 
structures and the homogenization of copies within and between tandem repeat 
arrays. 

Evidence for gene conversion in birds is limited to a few examples, mainly due to 
the fact that a high number of unique closely spaced markers are needed to detect 
it. Only with the advent of high-throughput sequencing technologies has this become 
feasible (Backstrom et al. 2005; Weber et al. 2014; Mugal et al. 2015); however, 
detection of gene conversion in highly homogenous tandem repeat arrays is yet 
impossible due to the abovementioned requirement. 
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Direct evidence for the influence of chromosomal features on recombination is 
also rare and mainly restricted to recombination suppression by large-scale 
rearrangements such as inversions (e.g., white-throated sparrows, Huynh et al. 
2011; ruffs, Kiipper et al. 2016). However, in crows it has been shown that large- 
scale tandem repeat arrays influence population genetic statistics (e.g., low genetic 
diversity and low population recombination rate) of nearby genomic regions, 
suggesting that these potentially are (peri)centromeric (Weissensteiner et al. 2017). 

Finally, certain repetitive elements might also be enriched in regions of high 
recombination rates. In flycatchers, for example, recombination rate analysis based 
on patterns of linkage disequilibrium showed an association between recombination 
hotspots and certain transposable elements (TEs). While the initiation of double¬ 
strand breaks and thus recombination might be directly influenced by TE activity, it 
seems that the shared requirement of open chromatin both for transcription and TE 
activity is causing the association of certain TEs with high-recombination regions 
(Kawakami et al. 2017). 


5.4 Histone Modification and Transcription 

5.4.1 Satellite Repeats 

The transcription of DNA is in general restricted to open chromatin (euchromatin), 
which is not condensed during interphase and thus accessible to the transcription 
machinery (Grewal and Jia 2007). However, recently it has become evident that 
some repetitive sequences residing in densely packed heterochromatin are also 
transcribed (Trofimova and Krasikova 2016). Two hypotheses have been put for¬ 
ward to resolve this: the “read-through” hypothesis states that some tandem repeats 
are transcribed simply because they are located in between two protein-coding genes 
and transcription continues into the adjacent noncoding region, failing to determine a 
stop codon (Varley et al. 1980). Evidence for this mechanism has been found in 
lampbrush chromosomes (highly enlarged chromosomes of growing oocytes) of 
newts (Varley et al. 1980; Diaz et al. 1981). An alternative explanation for the 
transcription of satellite DNA has been provided by Deryusheva et al. (2007). By 
analyzing chicken and quail lampbrush chromosomes, they found open reading 
frames (ORFs) for an endogenous retrovirus (ERV) directly adjacent to arrays of 
two 41-bp-long tandem repeats (CNM and P041). They postulate that transcription 
is initiated at those LTR promoters and read through into the tandem repeat arrays. 

Genomic regions with a high concentration of satellite DNA in the form of 
large heterochromatic tandem repeat arrays show in general different histone modifi¬ 
cations than euchromatin (Grewal and Jia 2007). To prevent an excess of double¬ 
strand breaks in these often very fragile regions, histone proteins (on which the DNA 
is coiled around) are methylated (as opposed to acetylated in euchromatic regions) 
and result in densely packed chromatin (George and Alani 2012). Hetero- 
chromatization is also a defining feature of sex chromosomes, in particular the 
avian W chromosome (Stefos and Arrighi 1971) (see Sect. 4.2.1). Also in 
germline-restricted chromosomes (see Sect. 4.2.2), methylation of certain histones 
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is the mechanism providing condensation and transcriptional silencing (Goday and 
Pigozzi 2010). However, further research is needed to uncover the exact mechanisms 
of heterochromatization and details on the evolutionary dynamics between large 
tandem repeat arrays and histone modification in bird genomes. 

5.4.2 Transposable Elements 

With their ability to disperse copies of themselves throughout the genome, 
transposable elements may insert into genomic contexts where they influence the 
transcription of nearby genes. This can have significant phenotypic effects, as 
recently shown for a textbook example of evolutionary biology, the industrial 
melanism in peppered moths (van’t Hof et al. 2016). The copies of many TEs 
contain their own transcriptional promoters or enhancers (e.g., located within 
SINE heads, solo-LTRs, and LINE 5' UTRs; see Sect. 2) and through these 
structures may influence the expression of nearby genes or even gene regulatory 
networks (reviewed by Slotkin and Martienssen 2007; Chuong et al. 2017). In the 
human genome, thousands of ancient TEs have been co-opted as transcriptional 
regulators (Bejerano et al. 2006; Lowe et al. 2007). Some copies from these same TE 
families are also evolutionarily conserved in birds (Craig et al. 2018), which 
suggests that they have had a gene regulatory function at least since the common 
ancestor of birds and mammals. However, the gene regulatory consequences of TE 
families active during the diversification of birds (Fig. 15) are poorly understood 
(Craig et al. 2018). To our knowledge, the only case of a phenotypic effect of a 
recent TE insertion in birds is the blue eggshell color of some chicken breeds. This 
phenotype is caused by a tissue-specific over-expression of the SLC01B3 gene due 
to a nearby insertion of a retrovirus-like LTR retrotransposon copy (Wang et al. 
2013; Wragg et al. 2013). 

In addition to such TE-induced increases in transcription, TE insertions may 
instead lead to lower transcription levels of nearby genes (Fig. 19). This is due to 
the role of DNA methylation as a mechanism to repress TE activity (see Sect. 5.2.1). 
According to the “sloping shores” model of Grandi et al. (2015), an individual TE 
insertion may decrease nearby gene expression because the hypermethylation of the 
TE may change the methylation status of a nearby promoter (Fig. 19b). Similarly, 
recent evidence suggests that new TE insertions in Drosophila can lead to a spread of 
repressive histone modifications up to 20 kb away from the TE (Lee and Karpen 
2017). However, the “sloping shores” model remains to be tested in birds, because 
knowledge on the methylation status of avian TEs is so far limited to ancient and 
medium-age TE insertions from great tit (Derks et al. 2016) (see Sect. 5.2.1). Recent 
or polymorphic TE insertions might be particularly promising candidates for causing 
differential methylation or differential gene expression within and between bird 
species, and recently developed methods are now capable to determine the methyla¬ 
tion status of individual TE insertions (Suzuki et al. 2015; Daron and Slotkin 2017). 
In this context, it is plausible that TE insertions with large phenotypic effects 
facilitate rapid adaptation (Stapley et al. 2015). We thus emphasize the importance 
of determining not only the phenotypic effects of candidate TEs but also of 
establishing whether or not they are under positive selection. In contrast to SNPs, 
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Fig. 19 DNA methylation of transposon insertions may affect nearby gene expression, (a) 
Schematic scenario of a TE insertion event close to a transcriptional promoter of a nearby gene, 
(b) Illustration of the “sloping shores” model of Grandi et al. (2015) where, following a TE insertion 
event, DNA hypermethylation of the TE (blue dashed box) may extend to a nearby promoter 
(orange dashed box) and lead to a lower transcription level of the affected gene. Alternatively, DNA 
hypomethylation of the TE may lead to a higher transcription level of a nearby gene. Figure loosely 
drawn after Grandi et al. (2015) 


the selective forces acting on individual TE insertions have been largely ignored so 
far. Recent methodological improvements (reviewed by Villanueva-Canas et al. 
2017), together with long-read sequencing, promise to reveal the evolutionary 
significance of TE-induced regulatory variation in birds. 


6 Limitations and Future Avenues to Analyze Genomic "Dark 
Matter" 

Rapid technological advances in the fields of genomics (Peona et al. 2018) and 
cytogenetics (Damas et al. 2017) have made it possible to continuously add layers of 
information to (avian) genomic studies which were previously hidden within geno¬ 
mic “dark matter.” The four most important advances after the introduction of high- 
throughput sequencing are long-read sequencing, linked-read sequencing, optical 
mapping, and chromatin interaction mapping (Eid et al. 2009; Lieberman-Aiden 
et al. 2009; Lam et al. 2012; Weisenfeld et al. 2017). They all share the utilization of 
long-range information from single DNA molecules. Long-read sequencing 
produces unbiased (i.e., without PCR-induced error) sequence reads with mean 
read lengths of routinely up to 20 kb, and while the sequencing error rate is usually 
quite high (-15%), the introduced error is random and can be overcome by increas¬ 
ing the amounts of reads covering any single position in the genome (e.g., Bickhart 
et al. 2017). Assemblies produced with this technology have resulted in unprece¬ 
dented contiguity with over 100-fold increases of contig N50 and thus enormously 
reducing the number of gaps in the assembly (Korlach et al. 2017; Weissensteiner 
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Fig. 20 A model for the resolution of genomic “dark matter” with current sequencing and genome 
mapping technologies. Gray bars indicate genomic “dark matter,” mostly nested within interspersed 
repeats (IRs) and tandem repeats (TRs). Hi-C, chromosome conformation capture; LRC, linked- 
read “cloud”; OM, optical mapping. Figure from Peona et al. (2018), used under CC BY license 

et al. 2017). More precisely and as illustrated in Fig. 20, long-read sequencing 
provides read lengths on the >10 kb level (with individual reads now well beyond 
100 kb, Jain et al. 2018), and we assume that this may resolve assembly gaps within 
most interspersed repeats and at the boundaries of tandem repeat arrays. Linked-read 
sequencing relies on microfluidic barcoding of long DNA molecules in individual 
beads and subsequent sequencing using short-read platforms. By that, reads can be 
assigned to individual molecules forming linked-read “clouds,” containing long- 
range information used for the assembly process (Weisenfeld et al. 2017). Optical 
mapping provides linkage information on the >100 kb level and may therefore at 
least anchor tandem repeat arrays into genome assemblies (Weissensteiner et al. 
2017). Chromosome conformation capture (Hi-C) provides linkage information on 
the >1 Mb level and may therefore span large tandem repeat arrays including 
centromeres, yielding scaffolds of the length of individual chromosomes (Bickhart 
et al. 2017; Dudchenko et al. 2017). 

At the moment, the focus lies on improving the long-range information in 
genome assemblies (as measured in contig or scaffold N50) because it is easily 
measurable. However, the main potential of resolving genomic “dark matter” lies in 
the ability to characterize the full spectrum of genetic variation occurring in eukary¬ 
otic genomes (Huddleston and Eichler 2016; Sedlazeck et al. 2018). Compared to 
the number of single-nucleotide polymorphisms, the amount of genetic variation due 
to structural changes (e.g., chromosomal rearrangements, tandem repeat dynamics, 
retrotransposition) accounts for the majority of differences in the DNA sequence 
between two human individuals (Chaisson et al. 2015). Yet, it is also unclear how 
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much of this uncharacterized variation has a direct relevance for the phenotype of an 
organism. A prime avian example for massive functional genetic variation hidden in 
obscure parts of the genome is the major histocompatibility complex (MHC) on 
chromosome 16, which plays an important role in immune function in vertebrates. 
This gene family is known to be notoriously difficult to assemble by itself, but 
additionally most of chromosome 16 is so repetitive that the vast majority of it 
remains inaccessible even in recent chicken genome assembly based on long-read 
sequencing (Warren et al. 2017) (see Sect. 4.2.3). Thus, even functionally important 
regions can reside within repetitive regions which are difficult to sequence and 
assemble. Studies on the genetic mechanisms underlying speciation are also likely 
to miss important information when looking for the targets of selection. Until 
recently, the focus lied solely on annotated protein-coding genes (Ellegren et al. 
2012; Poelstra et al. 2014), largely ignoring the role of undetected promoters and 
regulatory elements. The latter might include repetitive DNA if they emerged 
through co-option of transposable elements (Chuong et al. 2017). Furthermore, it 
is important to keep in mind that, for example, genetic elements that distort meiotic 
segregation (e.g., centromere drivers) may reside in repeat-rich heterochromatin in 
or near centromeres and are thus difficult to detect and assemble. 

Altogether, ornithologists now have a diverse toolkit to tackle the most challeng¬ 
ing tasks of avian genomics (Fig. 20), and we hope that the importance of resolving 
the “dark matter” of bird genomes will become more widely appreciated. 
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Abstract 

Reconstructing relationships among extant birds (Neomithes) has been one of the 
most difficult problems in phylogenetics, and, despite intensive effort, the avian 
tree of life remains (at least partially) unresolved. Thus far, the most difficult 
problem is the relationship among the orders of Neoaves, the major clade that 
includes the most (-95%) named bird species. This clade appears to have 
undergone a rapid radiation near the end Cretaceous mass extinction (the K-Pg 
boundary). On the other hand, if one embraces a “glass half full” view, the fact 
that most orders in Neoaves can be placed into seven clades, recently designated 
the “magnificent seven,” could be viewed as remarkable progress. We propose 
that the dawning era of whole-genome phylogenetics will only resolve the 
remaining relationships, if we improve data quality, exploit information from 
other sources (i.e., rare genomic changes), and learn more about the functional 
and evolutionary landscape of avian genomes. Of course, it is possible that the 
remaining unresolved relationships are unresolvable regardless of the data avail¬ 
able, but we suggest that the community should avoid this conclusion until more 
data collection has been completed and improved analyses have been conducted. 


Author contributed equally with all other contributors. Edward L. Braun, Joel Cracraft and Peter 
Houde 


E. L. Braun 

Department of Biology and Genetics Institute, University of Florida, Gainesville, FL, USA 
e-mail: ebraun68@ufl.edu 

J. Cracraft 

Department of Ornithology, American Museum of Natural History, New York, NY, USA 
e-mail: jlc@amnh.org 

P. Houde (M) 

Department of Biology, New Mexico State University, Las Cruces, NM, USA 
e-mail: phoude@nmsu.edu 


© Springer Nature Switzerland AG 2019 

R. H. S. Kraus (ed.), Avian Genomics in Ecology and Evolution , 

https://doi.org/10.1007/978-3-030-16477-5_6 


151 







152 


E. L. Braun et al. 


We say this because there is ample evidence that estimates of avian phylogeny 
based on large-scale datasets may be affected by well-characterized artifacts (e.g., 
long-branch attraction, heterotachy, and discordance among gene trees) and by 
subtle “data-type effects” that reflect poor fit to empirical data for available 
models of sequence evolution. Even if these analytical challenges can be 
addressed, we need to integrate phylogenomic and fossil data. Finally, we also 
emphasize that, regardless of the resolution (or lack thereof) for relationships 
among major avian clades, we are only at the dawn of the phylogenomics of birds. 
Large-scale molecular data remain unavailable for the vast majority of the 
-10,000 named bird species, and those named bird species probably represent 
an underestimate of the true number of distinct evolutionary lineages of birds 
(whether or not those lineages are assigned the rank of species) by as much as 
threefold. A true biodiversity genomics effort in birds is likely to reveal many 
additional examples of cases where it is very difficult to resolve relationships; the 
effort to resolve as many of those relationships as possible will represent a major 
scientific achievement and provide lessons for phylogenomic studies in other 
parts of the tree of life. 


Keywords 

Bird phylogeny • Phylogenetic estimation • Base composition • GC-content • 
Heterotachy • Rare genomic changes • Multispecies coalescent • Whole-genome 
sequencing • Phylogenomics 


1 Introduction 

A renaissance within systematics began in the late 1980s with the introduction of the 
polymerase chain reaction (PCR), which provided a simple method for directly 
harvesting and sequencing DNA for comparative studies (Higuchi and Ochman 
1989; Kocher et al. 1989; Saiki et al. 1988). This had an almost immediate effect 
within systematic biology in general (Hillis and Moritz 1990; Miyamoto and 
Cracraft 1991), as well as in ornithology in particular (Edwards et al. 1991; Edwards 
and Wilson 1990; Helm-Bychowski and Cracraft 1993; Mindell 1997; also see 
Sheldon and Bledsoe 1993 for a review describing the early history of molecular 
systematics in birds). The human genome project began during the same period 
(Cantor 1990; Watson 1990). This led to remarkable advances in DNA sequencing 
technologies and methods for the storage and analysis of sequence data that continue 
to have a profound impact on comparative biology, including the field of systematics 
(for additional historical details, see Wink 2019). These technical achievements 
resulted in DNA sequence comparative datasets of ever increasing sizes. Thus, the 
next decade saw the widespread expansion of the use of DNA sequence data in avian 
systematics, and knowledge about the avian tree of life deepened at all taxonomic 
levels (reviewed by Cracraft et al. 2004). In particular, studies of avian relationships 
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were characterized by increased taxon sampling and the inclusion of multiple 
mitochondrial regions and nuclear loci. 

The idea of “phylogenomics” has existed for about two decades. Although the 
original usage included the inference of gene function using evolutionary history 
(Eisen 1998; Eisen et al. 1997), the term has largely become synonymous with 
the use of large amounts of sequence data in phylogenetics (Delsuc et al. 2005). 
And today, there are many thousands of citations alluding to “phylogenomics,” and 
the majority of these imply the use of “genomic” or “large-scale” approaches to 
estimate the tree of life. In some sense, the phylogenomic era of avian systematics 
began with Hackett et al. (2008), in which the “Early Bird” consortium of 
investigators employed 19 loci (-32 kb of sequence) from 169 species to construct 
a phylogeny that included all major avian lineages (Table 1). It might be more 
accurate to view the Early Bird effort as the beginning of a “proto-phylogenomic 
era” of avian systematics because it represented a significant increase in scale of data 
collection relative to previous work but it still relied on PCR for gene sampling. 
However, the use of PCR sampling was not a major limitation, and, at the time, 
Hackett et al. (2008) provided the broadest support for relationships within and 
(to some degree) among avian orders. 


Table 1 Recent large-scale estimates of avian phylogeny : 


Study 

Number of 
neomithine taxa 

Data type b 

Analysis 6 

Branch 

lengths d 

Fain and Houde (2004) 

149 

one nuclear locus (nc) 

MP/ML/ 

BI 

n/a 

Ericson et al. (2006) 

111 

five nuclear loci (c/nc) 

BI 

time 

Livezey and Zusi (2007a) 

150 

morphology 6 

MP 

n/a 

Hackett et al. (2008) 

169 

19 nuclear loci (nc) 

ML 

mol 

Kimball et al. (2013) 

77 

50 nuclear loci (nc) 

ML 

mol 

McCormack et al. (2013) 

33 

1541 nuclear loci (nc) 

BI 

mol 

Jarvis et al. (2014) 

48 

12,020 nuclear loci 
(nc/c) 

ML 

time/ 

mol 

Prum et al. (2015) 

198 

259 nuclear loci (c) 

BI 

time 

Claramunt and Cracraft 
(2015) 

48/230 f 

up to 1156 nuclear 
loci (c) 

BI 

time 

Reddy et al. (2017) 

235 

54 nuclear loci (nc) 

ML 

mol 


a We define large-scale trees as those that include most or all of the orders, as defined by Cracraft 
(2013), in Neoaves 

b Data types are reported as “c” for primarily coding and “nc” for primarily noncoding (introns, 
UCEs, and untranslated regions) 

C BI Bayesian inference, ML maximum likelihood, MP maximum parsimony 

d Branch lengths available as estimates of absolute time or molecular change (substitutions per site). 

“n/a” indicates that branch lengths are not available 

e Livezey and Zusi (2007b) provide a detailed description of this morphological data; Mayr (2008) 
discussed issues with character scoring and the interpretations of avian morphological variation that 
are more congruent with molecular phylogenies 

f The 48-taxon tree in Claramunt and Cracraft (2015) reflects an analysis of 1156 clocklike coding 
regions. The 230-taxon tree reflects an analysis of two coding regions. Both analyses were 
constrained to the Jarvis et al. (2014) backbone. Additional analyses using the Prum et al. (2015) 
backbone are reported in their supplementary material 
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This achievement was soon surpassed by studies that adopted various next- 
generation sequencing technologies (Glenn 2011). Arguably, the new and innova¬ 
tive methods for sequence capture represent the most important technology for avian 
phylogenomics at this time. Those methods greatly expanded the harvesting of 
hundreds or thousands of loci across the genome (Table 2), truly launching the era 
of avian phylogenomics. Currently, one of the more popular sequence capture 
methods is the use of probes that hybridize to ultraconserved elements (UCEs), 
which correspond largely to noncoding regions that are conserved across most or all 
vertebrates (Bejerano et al. 2004). Many noncoding UCEs appear to be involved in 
gene regulation (Dimitrieva and Bucher 2012), but their function is not related to 
their use in phylogenetics. Instead, the conserved nature of the sequences allows 
probes that hybridize to UCEs to be used for sequence capture in many different 
vertebrates; most or all of the phylogenetic information in UCE datasets actually 
reflects the less-conserved sequences that flank the conserved core of the UCE 
(Crawford et al. 2012; Faircloth et al. 2012; McCormack et al. 2012, 2013). 
Analyses of UCEs have begun to advance our understanding of avian systematics 


Table 2 Published phylogenomic studies using the UCE sequence capture data 


Study 

Focal order 

Number of species 

Number of samples a 

Smith et al. (2014) 

Passeriformes 

5 

36 

Sun et al. (2014) 

Galliformes 

15 

— 

Bryson et al. (2016) 

Passeriformes 

30 

— 

Hosner et al. (2016) 

Galliformes 

23 

— 

Hosner et al. (2015b) 

Galliformes 

90 

— 

Manthey et al. (2016) 

Passeriformes 

11 

28 

McCormack et al. (2016) 

Passeriformes 

1 (3) b 

27 

Meiklejohn et al. (2016) 

Galliformes 

18 

— 

Moyle et al. (2016) 

Passeriformes 

106 

— 

Persons et al. (2016) 

Galliformes 

11 

— 

Zarza et al. (2016) 

Passeriformes 

3 

26 

Andersen et al. (2017) 

Coraciiformes 

21 

— 

Bruxaux et al. (2017) c 

Columbiformes 

6 

21 

Campillo et al. (2017) 

Passeriformes 

17 

— 

Hosner et al. (2017) 

Galliformes 

115 

— 

Wang et al. (2017) 

Galliformes 

20 

— 

White et al. (2017) 

Caprimulgiformes 

12 

— 

Andermann et al. (2018) 

Caprimulgiformes 

2 

9 

Musher and Cracraft (2018) 

Passeriformes 

29 

62 

Younger et al. (2018) 

Passeriformes 

3 

23 


a The total number of samples is listed if multiple individuals were sequenced for at least one of the 
focal species; a dash indicates that only one sample per species was sequenced 
b McCormack et al. (2016) sequenced 27 Western scrub jays (Aphelocoma californica ) from 
3 lineages that could represent species 

c Bruxaux et al. (2017) used genome skimming followed by bioinformatic extraction of UCE loci 
(and other loci) rather than sequence capture 




























Resolving the Avian Tree of Life from Top to Bottom: The Promise and... 


155 


at all levels, from the deepest branches (e.g., McCormack et al. 2013; Gilbert et al. 
2018) to the tips of the tree (Table 2). UCEs even appear to be useful at 
phylogeographic scales (Harvey et al. 2016; Smith et al. 2014). Lemmon et al. 
(2012) developed a similar approach based on a distinct probe set (one focused 
largely on coding exons) that they called anchored hybrid enrichment; like UCE 
sequence capture, anchored hybrid enrichment is capable of harvesting vast 
quantities of data, and it is also being applied in avian systematics (Prum et al. 2015). 

At present, the only higher-level phylogenetic study of birds that has employed 
whole genomes (more accurately, draft genome sequences) is that of Jarvis et al. 
(2014). However, analyses of draft genome sequences are increasingly informing 
avian evolutionary studies (Cometti et al. 2015; Lamichhaney et al. 2015; 
Nadachowska-Brzyska et al. 2015; Nater et al. 2015; Poelstra et al. 2014; Toews 
et al. 2016a; Tuttle et al. 2016; Ottenburghs et al. 2017a; Stryjewski and Sorenson 
2017; Tiley et al. 2018; for recent general reviews of avian evolutionary genomics, 
see Joseph and Buchanan 2015; Kraus and Wink 2015; Toews et al. 2016b). Indeed, 
there are currently major efforts in multiple labs to increase the number of avian 
genome sequences (like the B10K project; Zhang et al. 2015), and within just a few 
years, the amount of comparative data available for phylogenetic and evolutionary 
studies in birds will expand exponentially. Yet, although the last decade has seen a 
great improvement in our understanding of avian relationships, these large-scale data 
have also revealed remarkable incongruence in avian relationships due to using 
different genes, distinct data types (e.g., across exons, introns, and UCEs), and 
various taxon and character sampling regimes (e.g., Jarvis et al. 2014; Hosner 
et al. 2015b, 2016; Ottenburghs et al. 2016a; Reddy et al. 2017). Thus, far from 
heralding the “end of incongruence,” as Gee (2003) asserted, phylogenomics has 
actually revealed the complex nature of the phylogenetic signals in genomes (e.g., 
Jarvis et al. 2014). This raises several questions: Why do some nodes in 
phylogenomic trees have limited support even when large amounts of data are 
analyzed? What does it mean to use “whole genomes” in phylogenetics? What are 
the limitations to “whole-genome” analysis? If we shift our focus toward avian 
phylogenomics more specifically, there are several additional questions that emerge: 
What are the most problematic relationships in the bird tree based on the data and 
analyses that are currently available? And finally, what are likely to be the best ways 
forward to resolve many of the very difficult problems that still exist across the avian 
tree? 

In this chapter we seek to address these questions by examining several funda¬ 
mental issues relevant to the theory and practice of phylogenetics, using the evolu¬ 
tion of birds as a “model system” to understand phylogenomic methods. We focus 
exclusively on extant birds (Neornithes) and refer readers to recent reviews (Brusatte 
et al. 2015; Wang and Zhou 2017) for discussions of Mesozoic birds. We also avoid 
discussion of non-avian dinosaurs, although we would like to point interested 
readers to the chapter on “dinosaur genomics” by Griffin et al. (2019). This chapter 
begins by highlighting the progress that has been made (or has not been made) in 
elucidating the avian tree of life, largely focusing on higher-level taxonomic 
relationships. Throughout, we have used common names (unless their use is 
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unwieldy) in order to make this chapter more accessible to readers working outside 
of avian systematics (the scientific names associated with those common names in 
Table 3). Then we discuss the lessons of current genome-scale efforts to estimate the 
bird tree, focusing on analytical challenges, such as the impact of “data-type effects” 
(Reddy et al. 2017) and the computational challenges (e.g., the need to use more than 
400 years of CPU time to analyze 48 bird genomes; see Jarvis et al. 2014). We place 
this review of avian phylogenomics in a broader context, discussing the potential for 
phylogenomics to have an impact on the fields of systematics, paleontology, geno¬ 
mics, molecular biology, evolutionary developmental biology (“evo-devo”), and 
biodiversity studies more generally. We also discuss the potential for those fields 
to influence avian phylogenomics. 

We have sometimes been deliberately provocative in this review, making it our 
goal to summarize hypotheses embraced by many in the avian systematics commu¬ 
nity and to play devil’s advocate regarding those hypotheses. We believe that this 
will stimulate integrative studies to answer the many remaining questions in avian 
phylogeny. Finally, we take a look forward and discuss the potential impact of 
very high-quality (“platinum” or “reference” quality) genome assemblies generated 
using third- and fourth-generation sequencing technologies (for recent reviews of 
sequencing technologies, see Bleidom 2016; Feng et al. 2015; Korlach et al. 2017). 
We certainly expect platinum-quality genome assemblies to have a major impact 
on phylogenomics, especially when those high-quality data are combined with 
improved analytical methods. However, the phylogenomic data available at this 
time (e.g., sequence capture and draft genome assemblies) have already enabled the 
community to solve phylogenetic problems that have long been thought to be 
intractable; we expect this trend to continue. 


2 What We Do (And Do Not) Know About the Avian Tree 

Systematic studies during the twenty-first century rapidly affirmed the 
non-monophyly of many traditionally recognized orders, notably Gruiformes, 
Ciconiiformes, Pelecaniformes, and Falconiformes. All of these figured prominently 
in classifications for more than a century, and their ordinal names have been retained 
in modern taxonomies (Table 3). However, the results of analyses using molecular 
data required subsuming some traditional orders into more inclusive ones (e.g., 
Ciconiiformes within Pelecaniformes, Apodiformes within Caprimulgiformes) or 
defining other orders more narrowly (e.g., Gruiformes and Falconiformes). Like¬ 
wise, some largely abandoned names have been resurrected (e.g., Accipitriformes) 
and some families to have been reassigned to ordinal rank (e.g., Mesitomithiformes 
and Eurypygiformes). Efforts based on many genes, including some early 
phylogenomic efforts, have also begun to resolve superordinal clades with some 
degree of confidence (see below). In some cases, these studies have revealed 
counterintuitive relationships, like the sister relationships of grebes and flamingos 
as well as the placement of tropicbirds sister to Eurypygiformes (Sunbittem 
and Kagu). 
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Below we showcase what is currently known with reasonable certainty about the 
relationships of the supraordinal lineages of birds (summarized as a consensus tree in 
Fig. 1), highlighting the contributions of phylogenomics to our understanding of the 
bird tree. At present, Jarvis et al. (2014) and Prum et al. (2015) are the largest-scale 
estimates of the avian tree. Jarvis et al. (2014) presented many trees, but we largely 
focus on the “total evidence nucleotide tree” (TENT), which was based on a 
maximum likelihood (ML) analysis of a data matrix comprising intronic, exonic, 
and UCE data. Prum et al. (2015) presented a single tree based on a Bayesian 
analysis of a largely exonic dataset. They also provided an ML tree with a virtually 
identical topology but much lower support; in this chapter, we focus on the ML 
bootstrap support because it is more comparable to the support values associated 
with Jarvis et al. (2014) trees. The tree in Fig. 1 is a strict consensus of the Jarvis et al. 
(2014) TENT and the Prum et al. (2015) tree, modified to include information about 
specific taxa that were not sampled by Jarvis et al. (2014). For those taxa that were 
not included in Jarvis et al. (2014), we also considered Reddy et al. (2017), who 
presented ML and Bayesian analyses of a data matrix dominated by noncoding 
sequences with a sample of taxa similar to the Prum et al. (2015) study. We view the 
consensus tree in Fig. 1 as the best corroborated hypothesis for the bird tree, but the 
topology of the bird tree remains far from certain at this time. We emphasize this 
uncertainty by calling attention to the key unsolved problems of higher-level 
relationships. 


2.1 Palaeognathae ("Ratites" and Tinamous) 

One of the most surprising results from the first comprehensive avian phylogeny 
based on multiple nuclear loci (Hackett et al. 2008) was the failure to support 
monophyly of ratites (the large, flightless paleognaths such as the ostrich and 
emu). Indeed, Hackett et al. (2008) had 100% bootstrap support for a node 
contradicting ratite monophyly, placing ostriches sister to a clade comprising the 
other ratites and the volant tinamous. Ratites had long been regarded as a textbook 
exemplar (e.g., Bergstrom and Dugatkin 2012; Futuyma 2005; Steams and Hoekstra 
2005) of Gondwana vicariance following a single loss of flight in their common 
ancestor (Cracraft 1973, 1974). The widespread occurrence of volant Paleogene 
paleognaths (lithornithids; Houde 1986, 1988) certainly raised questions regarding 
the prevailing paradigm that the distribution of ratites reflects a single loss of flight 
and vicariance due to the breakup of Gondwana in the Cretaceous. Likewise, 
analyses of complete mitochondrial genomes conducted shortly before Hackett 
et al. (2008) reported, at best, equivocal support for ratite monophyly (Braun and 
Kimball 2002; Slack et al. 2007) and analyses of some nuclear loci conflicted with 
ratite monophyly (MYC in Cracraft et al. 2004; GH1 in Yuri et al. 2008; combined 
CLTC and CLTCL1 in Chojnowski et al. 2008). However, an explicit hypothesis of 
ratite non-monophyly based on broad sampling of the genome was not advanced 
until Hackett et al. (2008). Ratite non-monophyly does not, in and of itself, falsify 
the Gondwana biogeography hypothesis. However, it does raise the possibility 
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Ostrich Struthioniformes 
Rheas Rheiformes 
Emu & Cassowaries Casuariiformes 
Kiwis Apterygiformes 
Elephant birds Aepyornithiformes 
MoaS Dinornithiformes 
TinamOUS Tinamiformes 
Waterfowl Anseriformes 
Landfowl Galliformes 
Flamingos Phoenicopteriformes 
Grebes Podicipediformes 
Doves Columbiformes 
Sandgrouse Pterodiformes 
MesiteS Mesitornithiformes 
Shorebirds Charadriiformes 
Cranes, Rails Gruiformes 
Hoatzin Opisthocomiformes 
Oilbird Caprimulgiformes 
PotOOS Caprimulgiformes 
Nightjars Caprimulgiformes 
FrogmOUthS Caprimulgiformes 
Owlet-nightjars Caprimulgiformes 
Swifts Caprimulgiformes 
Hummingbirds Caprimulgiformes 
TuraCOS Musophagiformes 
Bustards Otidiformes 
Cuckoos Cuculiformes 
Tropicbirds Phaethontiformes 
Sunbittern (& Kagu) Eurypygiformes 
Loons Gaviiformes 
Pelicans & allies Pelecaniformes 
TubenOSeS Procellariiformes 
Penguins Sphenisciformes 
New World vultures Accipitriformes 
Eagles, Hawks Accipitriformes 
Owls Strigiformes 
Mousebirds Coliiformes 
CuckOO-roller Leptosomiformes 
TrogonS Trogoniformes 
Hornbills & allies Bucerotiformes 
Rollers & allies Coraciiformes 
Woodpeckers & allies Pidformes 
Seriemas Cariamiformes 
Falcons Falconiformes 
Parrots Psittaciformes 
Passerines Passeriformes 


Palaeognathae 


Galloanseres 


Neoaves 

(Columbea) 


Neoaves 

(Passerea) 


Fig. 1 Consensus phylogenomic tree of birds. This backbone is a strict consensus of the Jarvis 
et al. (2014) “total evidence nucleotide tree” (TENT) and the Prum et al. (2015) tree. The division 
of Neoaves into Columbea and Passerea is based on Jarvis et al. (2014), although the division 
is not presented in the consensus tree because it is not present in the Prum et al. (2015) tree. 
The numbered clades correspond to the “magnificent seven” of Reddy et al. (2017): (1) “core” 
landbirds (Telluraves); (2) “core” waterbirds (Aequornithes); (3) tropicbirds and Sunbittern 
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of dispersal by a volant ancestor followed by independent losses of flight. The 
initial suggestion of ratite monophyly was followed by a detailed reanalysis of 
the Early Bird data focused on determining whether the support for ratite 
non-monophyly reflects misleading phylogenetic signal (Harshman et al. 2008); 
that study did not reveal any sources of bias (for a detailed discussion of bias in 
phylogenetic analyses, see below, in Sect. 3). Subsequent molecular studies, includ¬ 
ing reanalyses of mitochondrial DNA (Phillips et al. 2010), the addition of nuclear 
genes (Baker et al. 2014; Haddrath and Baker 2012; Smith et al. 2013), analyses of 
transposable element (TE) insertions (Baker et al. 2014; Haddrath and Baker 2012), 
and whole-genome analyses (Sackton et al. 2018), have also corroborated ratite 
non-monophyly. However, those studies have shown conflicts regarding the other 
relationships within paleognaths (we emphasize this by presenting the base of 
Palaeognathae except ostriches as a polytomy in Fig. 1). 

The recent results raise a profound question about our understanding of 
paleognath evolution: how strongly corroborated is the new paradigm? It is impor¬ 
tant to recognize that this new view of palaeognath history actually has three major 
components: (1) ratites are not monophyletic; (2) ratites had a volant and vagile 
ancestor that lost flight independently after dispersing; and (3) the morphological 
features that unite ratites reflect convergence. The first of these is strongly 
corroborated, although the phylogenomic study of Le Due et al. (2015), which 
included an analysis of 623 coding regions, supported ratite monophyly. However, 
the Le Due et al. (2015) results are unlikely to be accurate because their taxon 
sampling was poor (the only paleognaths sampled were ostrich, a kiwi, and a 
tinamou) and analyzed coding data, which is more likely to be misleading than 
noncoding sequences (see below, in Sect. 3). It is possible to argue that the second 
and third components of the current hypothesis are more equivocal. They are also the 
more interesting components of the current hypothesis. A tree topology that nests the 
volant tinamous within the flightless ratites does not provide definitive evidence for 
multiple losses of flight; logically, it could reflect a single loss of flight followed by a 
reversal to a volant state in tinamous. In fact, a skeptic might point out that a single 
loss of flight followed by the reacquisition of flight in tinamous is actually the most 
parsimonious optimization of the volant/flightless character state. Several authors 
(e.g., Harshman et al. 2008; Smith et al. 2013) argued against that position by 
pointing out that there are many examples of evolution from a volant to flightless 
state (e.g., Wright et al. 2016) whereas evidence for evolution in the other direction 
is absent. That argument provides evidence for multiple losses of flight within 
paleognaths when the data are interpreted in a likelihood framework. The final 


Fig. 1 (continued) (Phaethontimorphae); (4) cuckoos, bustards, and turacos (Otidimorphae); 
(5) nightjars, swifts, hummingbirds, and allies (Caprimulgiformes); (6) doves, mesites, and sand- 
grouse (Columbimorphae); and (7) flamingos and grebes (Phoenicopterimorphae). We indicate the 
limited support for Otidimorphae using a hash (#). Reddy et al. (2017) called shorebirds, cranes, and 
the Hoatzin the “orphan orders”; shorebirds and cranes form a clade (Cursorimorphae) in the Jarvis 
et al. (2014) TENT but they represent independent lineages in the Prum et al. (2015) tree 
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component of the current hypothesis (that the other features that appear to unite 
ratites arose by convergence) might appear to have been resolved by Johnston (2011), 
who presented a morphological phylogeny that supports ratite non-monophyly and 
places ostrich sister to all other paleognaths (like the molecular studies). However, 
the vast majority of morphological phylogenies support ratite monophyly, including 
recent studies (Bourdon et al. 2009; Worthy and Scofield 2012; and the uncon¬ 
strained analyses in Worthy et al. 2017). This suggests a truly remarkable degree of 
convergence if the current molecular hypothesis is correct. 

Understanding the developmental basis for the loss of flight in different ratite 
lineages could provide a direct way to examine the multiple loss of flight hypothesis. 
Faux and Field (2017) found that tinamous retain the ancestral pattern of wing length 
development (assuming the chicken character state is ancestral) whereas ostriches 
and emus exhibit different patterns of wing development. However, Faux and Field 
(2017) ultimately map three character states onto a four-taxon tree; thus, all possible 
topologies are equally parsimonious (all trees require two character state changes to 
explain the observed data). It will probably be necessary to understand the molecular 
basis for the loss of flight in each lineage to resolve this issue definitively. Whole- 
genome sequencing along with the identification of functional elements (using 
approaches similar to Seki et al. 2017) is likely to facilitate the necessary evo-devo 
studies. Sackton et al. (2018) used this approach, analyzing 14 paleognath genome 
assemblies and identifying 63 noncoding elements that are likely to be transcrip¬ 
tional enhancers that also exhibit an unusually high degree of sequence divergence 
in ratites. They examined one of these “ratite-accelerated regions” experimentally, 
finding that the chicken or tinamou sequences had enhancer activity in the developing 
chick forelimb, whereas the orthologous rhea sequence did not. Similar experiments 
focused on “ratite-accelerated regions” from other paleognath species should be very 
informative (e.g., Cloutier et al. 2018). Examining the activity of “resurrected” 
ancestors of these regions (i.e., sequences reflecting computational ancestral state 
reconstructions) could be even more informative; the types of experiments could be 
conducted by combining standard ancestral state reconstruction methods (e.g., 
Huelsenbeck and Bollback 2001) with approaches from synthetic biology (reviewed 
by Hughes and Ellington 2017). These types of tools are likely to usher in a new era 
for our understanding of these fascinating birds. 

Studies focused on the phylogenetic position of extinct paleognaths (moas and 
elephant birds) represent another exciting research area in that they have now moved 
into the phylogenomic era (Baker et al. 2014; Grealy et al. 2017; Yonezawa et al. 
2017). Surprisingly, these investigations have further corroborated earlier studies 
that placed the Neotropical tinamous as sister to the extinct New Zealand moas 
(Haddrath and Baker 2012; Phillips et al. 2010; Smith et al. 2013), on the one hand, 
and the New Zealand kiwis sister to the extinct Malagasy elephant birds, on the other 
(Mitchell et al. 2014). Estimates of divergence times for key taxa in these clades 
(50-60 Mya for the kiwi and elephant bird; Mitchell et al. 2014; Yonezawa et al. 
2017) have also been interpreted as precluding plausible avenues of overland 
dispersal. The very young age for crown paleognaths inferred by Prum et al. 
(2015) implies that those divergence times could be even more recent. Berv and 
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Field (2018) suggested the conflicts in divergence times reflect accelerated molecu¬ 
lar evolution early in the Paleogene. They attributed this to a combination of the 
observation that birds with smaller body size also tend to exhibit higher rates of 
molecular evolution combined with the “Lilliput effect,” which is the tendency for 
the lineages that survive mass extinctions to exhibit a marked decrease in body size 
(Urbanek 1993). However, estimates of paleognath divergence time are probably too 
recent for overland dispersal even if the Berv and Field (2018) hypothesis of an early 
Paleogene rate acceleration is incorrect; if their hypothesized rate acceleration 
is correct, it would provide additional evidence against overland dispersals. Those 
late divergences among paleognaths have been viewed as additional evidence 
corroborating the multiple loss of flight hypothesis (reviewed by Allentoft and 
Rawlence 2012; for a dissenting argument, see Worthy and Scofield 2012, p. 88). 
Regardless of the details, it seems clear that studies focused on both extant and 
extinct paleognaths will continue to provide many interesting findings in the 
genomic era. 


2.2 Galloanseres (Landfowl and Waterfowl) 

In sharp contrast to Palaeognathae and Neoaves (see below), in which there is 
substantial uncertainty regarding many relationships, the picture for Galloanseres 
is one of greater certainty. Monophyly of Galloanseres was initially controversial 
(e.g., Ericson 1996), but that controversy was resolved prior to the phylogenomic era 
(cf. Cracraft 2001). Relationships among the families were also established without 
phylogenomic data, with Cox et al. (2007) resolving the last major question, the 
positions of New World quail (Odontophoridae) and guineafowl (Numidae), using 
only eight nuclear loci and three mitochondrial regions. Phylogenomic approaches 
have been remarkably successful within galloanserine families; species-rich trees 
with 100% bootstrap support at almost every node are now available, using sequence 
capture data for Galliformes (summarized in Table 2) and more than 6.6 million base 
pairs (Mbp) of coding sequence data for Anseriformes (Ottenburghs et al. 2016a). 
There are certainly a few relationships that remain poorly supported, both in 
Galliformes (Hosner et al. 2015b; Meiklejohn et al. 2016) and Anseriformes (e.g., 
Reddy et al. 2017 was unable to resolve the radiation of tribes within Anatidae with 
confidence). Indeed, the poor resolution of anatid tribes might be viewed as overly 
pessimistic based on Sun et al. (2017), although that study reflected analyses of 
the mitochondrial genome which is ultimately a single gene tree that can differ from 
the species tree (see below in Sect. 3). Regardless, problematic nodes within 
Galloanseres are the exception and not the rule. Likewise, some taxa have not 
been included in phylogenomic trees because they have only been sampled for a 
limited amount of data (or because molecular data remain unavailable). However, 
analyses of sparse supermatrices that combine legacy markers (sequences obtained 
by PCR) with phylogenomic data have proven to be a successful strategy even in 
those cases for which specific taxa are data limited (e.g., Hosner et al. 2016; Persons 
et al. 2016). Overall, it is very likely that a species-level phylogenomic tree of 
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Galloanseres will be completed in the near future (at least for the level of species 
named in current checklists). 

One area in which our knowledge of the early evolution of Galloanseres is limited 
is the estimation of time-calibrated trees; this is problematic because the oldest 
neomithine fossils are putatively galloanserines. The oldest of these, Austinornis 
lentus from the Cretaceous Austin chalk, can probably be dismissed. The fossil plac¬ 
ing Austinornis within Galloanseres (as a stem galliform) is fragmentary, and Clarke 
(2004) was only able to score it for 9 of 202 characters; the skeptical position that 
Austinornis is a stem galloanserine (or even some other Cretaceous bird lineage) is 
more appropriate than viewing it as evidence for the existence of crown Galloanseres 
ca. 85 million years ago (Mya). Another putative Cretaceous galloanserine, Vegavis 
iaai , has attracted substantial attention because Clarke et al. (2005) placed it within 
Anseriformes with high (99%) bootstrap support. However, more recent analyses 
place Vegavis outside crown Anseriformes (Agnolin et al. 2017; Lee et al. 2014; 
O’Connor and Zhou 2013; Worthy et al. 2017). Indeed, Mayr et al. (2018) went 
further and questioned whether Vegavis was even galloanserine. Placing Vegavis 
within crown Anseriformes has a major impact on divergence time estimates for the 
avian tree as a whole (e.g., Prum et al. 2015), so resolving its position with 
confidence is critical. Ancient galloanserine fossils that are reliably placed within 
crown Anseriformes do exist (e.g., Prebyomithidae; De Pietri et al. 2016), but they 
are younger than Vegavis. For example , the oldest fossil placed within the 
anseriform crown in the maximum parsimony (MP) and Bayesian analyses of 
Worthy et al. (2017) was the Eocene Preshyornis pervetus. Kurochkin et al. 
(2002) did place the Cretaceous Teviornis gobiensis in Presbyomithidae, but the 
presbyomithid affinities of Teviornis are questionable (Clarke and Norell 2004). 
Indeed, all of the putative upper Cretaceous fossils assigned to extant avian orders, 
including a number of galloanserines (see Hope 2002), are quite fragmentary. 
Fountaine et al. (2005) suggested the fragmentary nature of the Cretaceous 
neomithine fossil record reflects a biological signal given the collecting efforts; if 
so, it strongly suggests that the ancient calibrations that have been used for 
Galloanseres in many clock studies are inappropriate. Even within the Paleogene, 
a number of galloanserine fossils appear to have been placed incorrectly (Ksepka 
2009; Wang et al. 2016). Finding better ways to incorporate fossil evidence is likely 
to be more important for establishing divergence times for Galloanseres (and birds as 
a whole) than the availability of genome-scale sequence data. 


2.3 Neoaves (All Remaining Extant Birds) 

Recent phylogenomics studies support the division of Neoaves into ten major 
lineages, seven of which contain multiple orders. Reddy et al. (2017) called the 
superordinal clades the “magnificent seven” and referred to the remaining three 
lineages as the “orphan orders.” Interrelationships among these major lineages 
remain poorly resolved [Fig. 1; also see Thomas (2015) for another comparison of 
the Jarvis et al. (2014) and Prum et al. (2015) trees]. Moreover, differences across 
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recent studies are sensitive to the data type used for analyses (i.e., the estimates of 
phylogeny differ depending on whether exons, introns, noncoding ultraconserved 
elements, conserved non-exonic elements, or TE insertions were used for analyses; 
Jarvis et al. 2014; Reddy et al. 2017). The observed conflict among analyses is so 
striking that Suh (2016) suggested that the relationships among these taxa might 
reflect a hard polytomy. We are skeptical of this extreme hypothesis; there has been 
remarkable progress toward a satisfying resolution of Neoaves in the early part of the 
phylogenomic era (i.e., the convincing evidence for the magnificent seven). We 
believe this progress represents a good reason to expect that continued data collec¬ 
tion and method development will ultimately resolve relationships among these 
major groups. However, it is clear that branching did occur over a very short interval 
of time early in the Paleocene, and this has made Neoaves one of the most difficult 
problems in modem phylogenetics. Below we summarize the ongoing progress 
made toward resolving the neoavian tree and highlight some of the remaining 
open questions (although we emphasize that our discussion is not exhaustive). 

The Base of the Neoavian Tree Identifying the basal split within the Neoaves has 
long been a vexing problem, but one that is seemingly approaching resolution. Initial 
large-scale analyses (Fain and Houde 2004; Ericson et al. 2006; Hackett et al. 2008) 
identified a large but relatively poorly supported cluster called Metaves (we indicate 
the “metavian” in Fig. 2 with an asterisk). Metaves comprised Caprimulgiformes 
(as currently defined to include nightjars, swifts, hummingbirds, and allies) along 
with doves, mesites, sandgrouse, flamingos, grebes, the Kagu and Sunbittem 
(Eurypygiformes), tropicbirds, and (in some analyses) the Hoatzin. Those studies 
placed Metaves sister to all other neoavian taxa, the latter being designated 
Coronaves by Fain and Houde (2004). Kimball et al. (2013) showed that the signal 
supporting Metaves was almost exclusively associated with a single locus 
(FGB/p-fibrinogen). Phylogenomics is based on the idea that analyses of many 
loci can overcome the existence of misleading signal in any individual locus. The 
preponderance of the phylogenomic data of Jarvis et al. (2014) clarified the phylog¬ 
eny of metaves, identifying a cluster that includes flamingos and grebes along with 
doves, sandgrouse, and mesites (collectively named Columbea by Jarvis et al. 2014) 
and placing that clade sister to all other Neoaves (termed Passerea; Jarvis et al. 
2014). In contrast, Prum et al. (2015) places Caprimulgiformes sister to all other 
Neoaves (Fig. 2). Both Jarvis et al. (2014) and Prum et al. (2015) dispersed the other 
metavian taxa across several other basal neoavian lineages. 

Reddy et al. (2017) analyzed the topological conflicts between Jarvis et al. (2014) 
and Prum et al. (2015) and concluded that the observed conflicts reflect “data-type 
effects,” which they defined as the observation that there are “different signals 
associated with analyses of subsets of the genome that can be defined a priori 
using non-phylogenetic criteria.” The data used to generate the Prum et al. (2015) 
tree were 82.5% exonic, and this very probably led to incorrect taxonomic groupings 
on their tree, whereas the Jarvis et al. (2014) TENT reflects an analysis of 41.8 Mbp 
that included a mixture of introns, coding exons (first and second codon positions 
only), and noncoding UCEs. Reddy et al. (2017) argued that exons are more likely to 
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A Jarvis et al. TENT 



Flamingos Phoenicopteriformes* I 
Grebes Podicipediformes* 

Doves Columbiformes* 
Sandgrouse Pterodiformes* 
Mesites Mesitornithiformes* 
Nightjars & allies Caprimulgiformes^ 
Cuckoos Cuculiformes 
Bustards Otidiformes 
Turacos Musophagiformes 
Hoatzin Opisthocomiformes* 
Shorebirds Charadriiformes 
Cranes, Rails Gruiformes 
Tropicbirds Phaethontiformes* 
Sunbittern Eurypygiformes* 

Loons Gaviiformes 
Pelicans & allies Pelecaniformes 
Tubenoses Procellariiformes 
Penguins Sphenisciformes 
New World vultures Accipitrifom 
Eagles, Hawks Accipitriformes 
OwlS Strigiformes 
Mousebirds Coliiformes 
Cuckoo-roller Leptosomiformes I 
Trogons Trogoniformes 
Hornbills & allies Buoerotiformes I 
Rollers & allies Coraciiformes I 
Woodpeckers & allies Piciformes 
Seriemas Cariamiformes 
Falcons Falcon iformes 
Parrots Psittaciformes 
Passerines Passeriformes 


0 Prum et al. anchored hybrid enrichment tree 



Nightjars & allies Caprimulgiformes^H 
Cuckoos Cuculiformes 
Bustards Otidiformes 
Turacos Musophagiformes 
DOVeS Columbiformes* 

Sandgrouse Pterodiformes* 

Mesites Mesitornithiformes* 

Cranes, Rails Gruiformes 
Flamingos Phoenicopteriformes* 
Grebes Podicipediformes* 

Shorebirds Charadriiformes 
Tropicbirds Phaethontiformes* 
Sunbittern Eurypygiformes* 

Loons Gaviiformes 
Pelicans & allies Pelecaniformes I 
Tubenoses Procellariiformes 
Penguins Sphenisciformes 
Hoatzin Opisthocomiformes* 

New World vultures Accipitriformes I 
Eagles, Hawks Accipitriformes 
Owls Strigiformes 
Mousebirds Coliiformes 
Cuckoo-roller Leptosomiformes | 
Trogons Trogoniformes 
Hornbills & allies Bucerotiformes 
Rollers & allies Coraciiformes 
Woodpeckers & allies Piciformes I 

Seriemas Cariamiformes 
Falcons Falconiformes 
Parrots Psittaciformes 
Passerines Passeriformes 


Fig. 2 The Jarvis et al. (2014) TENT and the Prum et al. (2015) anchored hybrid enrichment 
(sequence capture) tree exhibit many differences at the base of Neoaves. Both of these trees are 
presented as rooted trees for Neoaves and taxa placed in “Metaves” (see text) are indicated with 
asterisks, (a) Jarvis et al. (2014) TENT with low-support branches (branches with < 100% bootstrap 
support) indicated using thin lines. Very low-support branches (branches with <70% bootstrap 
support) are indicated with a hash (#) below the relevant branch. The strongly supported basal 
division of Neoaves into Columbea and Passerea is indicated, (b) Prum et al. (2015) tree with 
low-support (branches with <70% bootstrap support) and very low-support (branches with <50% 
bootstrap support) are indicated in the same manner. Taxa placed in Columbea and Passerea are also 
indicated to side of the Prum et al. (2015) tree to emphasize their non-monophyly in that analysis; an 
“expanded waterbird clade” (named Aequorlitomithes by Prum et al. 2015) is indicated using a gray 
box. We used different bootstrap support cutoffs for the two trees because they are based on data 
matrices of different sizes (more than 13,000 loci for the Jarvis et al. 2014 TENT but only 259 loci 
for the Prum et al. 2015 tree) 


be misleading than noncoding regions for two reasons: (1) coding exons exhibited 
greater GC-content variation than noncoding regions and (2) the structure of the 
genetic code combined with selection to maintain the amino acid sequence 
represents a violation of most models used for phylogenetic analyses. However, 
the fundamental observations underlying Reddy et al. (2017) were empirical: 
(1) analyses of a largely noncoding 54-locus data matrix for 235 species supported 
the same basal split in Neoaves as the Jarvis et al. (2014) TENT; and (2) trees based 
on large-scale coding datasets exhibited more topological similarities to each other 
than to trees based on noncoding data. Reddy et al. (2017) also found that trees based 
on rare genomic changes, like TE insertions, were more congruent with the trees 
based on noncoding data than with the trees based on coding data (see Fig. 6 in 
Reddy et al. 2017). Taken as a whole, those results indicate that the observed 
differences between Jarvis et al. (2014) and Prum et al. (2015) are more likely to 
reflect data type than taxon sampling. We address below some of these remaining 
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conflicts revealed by these three studies as they represent some of the most important 
puzzles in avian phylogenetics. 

Phoenicopteromorphae (Clade 7; Also Called Mirandornithes) 
and Columbimorphae (Clade 6) There is little question that a close relationship 
between flamingos and grebes (clade 7 in Fig. 2) was astonishing to most 
ornithologists when it was first proposed (Van Tuinen et al. 2001). However, that 
node was relatively easy to resolve (even with mitochondrial DNA; Van Tuinen 
et al. 2001; also see Cracraft et al. 2004, p. 476) and was present in the results of 
analyses conducted by many different investigators in multiple labs (e.g., Chubb 
2004; Fain and Houde 2004; Mayr 2004a; Ericson et al. 2006; Hackett et al. 2008). 
From a result that few believed at the time, it is now solidly accepted. The closest 
relative of the flamingo-grebe clade is another matter. Hackett et al. (2008) placed 
them sister to a subset of metavian taxa, in a clade comprising doves, mesites, and 
sandgrouse (clade 6 in Fig. 2) and tropicbirds. Jarvis et al. (2014) placed flamingos 
and grebes sister to doves, mesites and sandgrouse, separating the tropicbirds from 
both lineages. Importantly, the clade comprising doves, mesites, sandgrouse, 
flamingos, and grebes (Columbea) was strongly (100% bootstrap) supported and 
sister to all other Neoaves (Passerea). Reddy et al. (2017) also recovered the deep 
division between Columbea and Passerea, albeit with lower bootstrap support 
(>95% for Columbea but only 50-90% for Passerea). In sharp contrast, Pram 
et al. (2015) did not recover Columbea or Passerea. Instead, Pram et al. (2015) 
placed flamingos and grebes sister to shorebirds in a “generalized waterbird clade” 
(emphasized using a gray box in Fig. 2). Provocatively, many analyses of coding 
exons in Jarvis et al. (2014) also supported a generalized waterbird clade, albeit with 
rearrangements relative to Pram et al. (2015). Those results are consistent with the 
data-type effects hypothesis advanced by Reddy et al. (2017). Pram et al. (2015) also 
supported monophyly of doves, mesites, and sandgrouse but placed that clade sister 
to cuckoos, bustards, and turacos; they named this large clade, which comprises 
clades 4 and 6 from Fig. 2, Columbaves. We consider the latter relationship to be less 
likely given the nature of character support discussed by Reddy et al. (2017), but it 
seems clear that the Columbea and Columbaves hypotheses both deserve additional 
study. 

Caprimiilgiformes (Clade 5; Also Called Strisores) Many current taxonomies 
treat Caprimulgiformes (the clade comprising nightjars, nighthawks, oilbirds, 
potoos, frogmouths, owlet-nightjars, swifts, and hummingbirds) as a single order 
(Table 3), but those diverse taxa were split into at least two orders in older 
classifications (e.g., hummingbirds and swifts in Apodiformes; Table 3). For this 
reason, Reddy et al. (2017) viewed Caprimulgiformes as one of their “magnificent 
seven” superordinal clades. As described above, many analyses of coding exons 
(including Pram et al. 2015) place Caprimulgiformes sister to all other Neoaves. 
There is also uncertainty regarding the relationships among the families within 
Caprimulgiformes. This could reflect a data-type effect, since Reddy et al. (2017) 
place the potoo-oilbird clade sister to all other caprimulgiforms (Fig. 3a), whereas 
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Fig. 3 The position of the root of Caprimulgiformes (clade 5) is uncertain, (a) Most analyses 
reported by Reddy et al. (2017) place the root between the oilbird + potoos clade and all other 
Caprimulgiformes, although some analyses (see supporting material for Reddy et al. 2017) place the 
root between potoos and all other Caprimulgiformes. (b) Prum et al. (2015) place the root between 
the nightjar family and all other Caprimulgiformes. (c) The unrooted ingroup topology for 
Caprimulgiformes is identical in the Prum et al. (2015) and Reddy et al. (2017) analyses, but the 
position of the root differs. Support is indicated as above, with thin branches or thin branches and a 
hash (#). We do not consider Jarvis et al. (2014) in this figure because that study only included three 
Caprimulgiformes (a nightjar, a swift, and a hummingbird); relationships among those taxa are 
strongly supported in many analyses (including analyses of individual genes; cf. fig. lb in Hackett 
et al. 2008) 


Prum et al. (2015) place nightjars and nighthawks (Caprimulgidae) sister to all 
other caprimulgiforms (Fig. 3b). Both studies support a clade comprising the 
five remaining families (frogmouths, owlet-nightjars, swifts, tree-swifts, and 
hummingbirds). The uncertainty actually reflects alternative placements of the root 
since both studies supported the same unrooted ingroup topology (Fig. 3c). Obvi¬ 
ously, this group is an excellent target for additional phylogenomic analyses. In fact, 
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phylogenomic analyses of caprimulgiforms could be especially informative given 
their extensive Paleogene fossil record (Mayr 2009, 2004b). Those fossils have long 
suggested that nightjars and their relatives arose early in the neoavian radiation so 
they can provide an excellent source of calibrations for molecular clock studies. 

Otidimorphae (Clade 4) A clade comprising cuckoos, bustards, and turacos has 
emerged in only the most recent phylogenomic studies (Jarvis et al. 2014; Prum 
et al. 2015), although we do not view this group as decisively established (Fig. 1). 
Clade 4 has strong (100%) bootstrap support in the Jarvis et al. (2014) TENT, which 
placed cuckoos sister to a turaco-bustard clade. However, the position of turacos 
sister to bustards is the most poorly supported clade in the Jarvis et al. (2014) TENT 
(only 55% bootstrap support). Prum et al. (2015) does not place turacos sister to 
bustards; instead, the Prum et al. (2015) tree places turacos sister to a cuckoo-bustard 
clade. Clade 4 does have strong support in the Bayesian analysis reported by Prum 
et al. (2015), but it has very low support (only 41%) in their ML tree. The cuckoo- 
bustard clade was strongly supported (98% bootstrap) in the Prum et al. (2015) ML 
analysis. Reddy et al. (2017) did not recover clade 4. Instead, Reddy et al. (2017) 
recovered a cuckoo-bustard clade (albeit with limited support in many analyses) 
and placed turacos elsewhere, sister to cranes, rails, and allies (Gruiformes). 
Provocatively, Reddy et al. (2017) also found the turaco-gruiform clade in their 
reanalyses of the Prum et al. (2015) dataset after excluding data that was also present 
in the Jarvis et al. (2014) TENT dataset (this was done to produce a coding exon 
dataset that was truly independent of Jarvis et al. 2014). Clearly, relationships among 
these taxa, their relationships to other clades, or whether they even form a clade is 
still an open question that is ripe for additional phylogenomic exploration. 

Shorebirds, Cranes, Rails, and the Hoatzin (the “Orphan Orders” in Reddy 
et al. 2017) The three orders containing shorebirds (Charadriiformes); cranes, rails, 
and allies (Gruiformes); and the Hoatzin (Opisthocomus hoazin , the only extant 
species in the order Opisthocomiformes) form a clade on the Jarvis et al. (2014) 
TENT, although this clade did not receive 100% bootstrap support in the TENT. In 
contrast to the TENT, the larger Jarvis et al. (2014) “whole-genome tree” placed 
shorebirds sister to core landbirds (core landbirds are clade 1 in Fig. 2) and placed 
caprimulgiforms as sister to gruiforms, with the Hoatzin as their sister. The whole- 
genome tree reflected an analysis of 322 Mbp, more than seven times larger than the 
TENT dataset. However, Jarvis et al. (2014) expressed concern that the sequence 
alignment and the assessment of orthology for the whole-genome tree dataset were 
inferior to the TENT dataset. Prum et al. (2015) placed the Hoatzin sister to the core 
landbirds (clade 1), naming that more inclusive clade Inopinaves. Prum et al. (2015) 
also separated shorebirds and gruiforms, placing the former in a clade with flamingos 
and grebes and the latter sister to a larger neoavian clade (Fig. 2). Reddy et al. (2017) 
also separated all three of these orders, albeit with low support. Additional 
phylogenomic analyses are clearly necessary to resolve these relationships. 

Other than relationships among palaeognathous birds (see above), perhaps the 
most long-standing problem in avian higher-level relationships has been that of the 
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Hoatzin. The Hoatzin exhibits peculiar specializations, including a folivorous diet 
and foregut fermentation as well as the hypertrophy and use of forelimb claws by 
nestlings. These features fueled fanciful speculations that the Hoatzin might be 
primitive among extant birds (Feduccia 1996; Olson 1985). Historically, many 
authorities placed Hoatzin close to fowl (Galliformes) or within a questionably 
monophyletic circumscription of Cuculiformes (in that case defined as comprising 
both cuckoos and turacos). Studies that supported the latter position placed the 
Hoatzin sister to either cuckoos or turacos (see Sibley and Ahlquist 1990 for review; 
also see Hughes and Baker 1999; Sorenson et al. 2003). This uncertainty continues 
today; as stated above, no phylogenomic study upholds any of the previous 
hypotheses. Moreover, none of the comprehensive phylogenomic studies (Jarvis 
et al. 2014; Pram et al. 2015; Reddy et al. 2017) agree with one another regarding the 
position of the Hoatzin. The McCormack et al. (2013) UCE study, which did not 
include all major avian clades, placed Hoatzin sister to shorebirds, similar to the 
Jarvis et al. (2014) TENT (Fig. 2a). However, it placed graiforms (represented in 
that study by the trumpeter, family Psophiidae) in another position. The difficulty 
of resolving the position of the Hoatzin likely reflects the rapid radiation of 
Neoaves itself and the fact that the Hoatzin lineage diverged very close in time to 
all lineages that are putative close relatives. Coupled with that is the monotypy of 
Opisthocomiformes, which eliminates the possibility of breaking up its long-branch 
stem. The fossil record unfortunately sheds little light on the issue except to 
document that early hoatzins of modern appearance (as far as it is known from 
limb bones) were distributed in South America, Europe, and Africa during the Oligo- 
Miocene (Mayr 2014; Mayr et al. 2011; Mayr and De Pietri 2014). 

Phaethontimorphae (Clade 3), Aequornithia (Core Waterbirds; Clade 2), 
and Telluraves (Core Landbirds; Clade 1) These clades comprise many pheno- 
typically distinct lineages. Clade 3 comprises tropicbirds and Eurypygiformes (the 
Sunbittern and Kagu), lineages placed in Metaves by early phylogenomic studies 
(Fig. 2). Later studies resolved them as a distinct lineage that is either related to core 
waterbirds (clade 2; Fig. 4a) or core landbirds (clade 1; Fig. 4b). This could also be a 
data-type effect; analyses that include exonic data (including the Jarvis et al. 2014 
TENT) place the tropicbird-eurypygiform clade as sister to waterbirds, whereas 
analyses of exclusively or almost exclusively noncoding data matrices [the analyses 
of introns and noncoding UCEs in Jarvis et al. (2014) and Reddy et al. (2017)] place 
the tropicbird-eurypygiform clade sister to landbirds. However, if this is a data-type 
effect, it is unusual. Reddy et al. (2017) proposed the terminology data-type effects 
in order to discuss the conflicts between the Jarvis et al. (2014) TENT and the Pram 
et al. (2015) tree with respect to the deepest divergence in Neoaves. In this case, the 
Jarvis et al. (2014) TENT (which reflects a data matrix that is 68% noncoding) 
supports monophyly of Columbea (clades 6 and 7) and places of Columbea sister to 
Passerea (all other Neoaves). The Jarvis et al. (2014) TENT and intron trees as well 
as the Reddy et al. (2017) tree support reciprocal monophyly of Columbea and 
Passera; the Jarvis et al. (2014) UCE tree supports monophyly of Passerea. Thus, the 
TENT resembles trees based on noncoding data. In contrast, the position of the 
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Pelicans Pelecanidae 
Hamerkop Scopidae 
Shoebill Balaenicipitidae 
Herons Ardeidae 

Ibises, Spoonbills Threskiornithidae 
Darters Anhingidae 
Cormorants Phalacrocoracidae 

Gannets, Boobies suiidae 
Frigatebirds Fregatidae 
Storks Ciconiidae 
Tubenoses Procellariiformes 
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Loons Gaviiformes 


Reddy et al. (2017) waterbird topology 


Fig. 4 Relationships among core landbirds (clade 1), core waterbirds (clade 2), and the 
tropicbird + eurypygiform clade (clade 3). (a) Phylogeny based on TENT, WGT, and coding 
data, (b) Phylogeny based on introns, UCEs, and non-coding data. The position of the 
tropicbird + eurypygiform clade sister to either core waterbirds or core landbirds depends on the 
data type analyzed (analyses of coding data and mixtures of coding and noncoding data support a 
waterbird sister hypothesis, whereas analyses of noncoding data alone support a landbird sister 
hypothesis), (c) The inset shows the topology within core waterbirds. Most large-scale relationships 
within core waterbirds are strongly supported and robust both to data type and to analytical 
approach, but there are two exceptions (indicated using Greek letters, see text for details) 


tropicbird-eurypygiform clade in the TENT corresponds to its position in analyses of 
coding data. Another potential explanation for the difficulty placing the tropicbird- 
eurypygiform clade is that they are long branches. This does not reflect length due to 
an accelerated rate of molecular evolution; instead it reflects the fact that neither 
lineage has close relatives. There are only three tropicbird species, all of which are 
very closely related and are placed in a single genus ( Phaethon ). There is another 
extant eurypygiform (the New Caledonian Kagu; Fig. 4), which is placed in a 
different family. Reddy et al. (2017) did break up the long branch to the Sunbittern 
as much as possible by including Kagu, and it remains possible that adding genome- 
scale Kagu data will be helpful; however, adding Kagu represents the limit for the 
addition of taxa to clade 3 in a meaningful way. Thus, the practice of adding taxa, 
which many systematists believe to be one of the best ways to improves estimates 
of phylogeny (Heath et al. 2008), is unlikely to be a way to improve placement of 
clade 3. Overall, because tropicbirds and eurypygiforms are both long branches that 
can only be subdivided near the tips, improved phylogenomic analyses may repre¬ 
sent the only hope for a convincing resolution of the position of the tropicbird- 
eurypygiform clade. 

The relationships among core landbirds, core waterbirds, and the tropicbird- 
eurypygiform clade are somewhat more complex than we indicate in Fig. 4. The 
TENT and intron tree in Jarvis et al. (2014) both support a clade comprising core 
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landbirds, core waterbirds, and tropicbirds + eurypygiforms, but most other analyses 
place other lineages within that clade. Analyses using exonic data typically nest the 
core waterbird + (tropicbird + eurypygiforms) clade within a larger clade comprising 
many aquatic and semiaquatic lineages (e.g., shorebirds, flamingos, and grebes); 
analyses using noncoding data separate those lineages. We believe this “greater 
waterbird clade” (named Aequorlitomithes by Prum et al. 2015) is unlikely to be a 
true clade because analyses of rare genomic changes, which are likely to have 
different strengths and weaknesses relative to analyses of either or coding and 
noncoding sequences (see below, in Sect. 3), yield a tree topology closer to the 
noncoding trees (see Reddy et al. 2017). 

The core waterbirds [clade 2, called Aequomithes by Mayr (2011) and 
Aequornithia by Cracraft (2013)] are perhaps one of the most exceptional groups 
in Neoaves: almost all major groups within this clade have the same topology in all 
recent phylogenomic analyses. There are, however, two nodes of interest (branches 
a and P in Fig. 4). Branch a, which unites pelicans with the Hamerkop, is especially 
surprising; the RAxML (Stamatakis 2014) analyses in both Prum et al. (2015) and 
Reddy et al. (2017) support the resolution shown in Fig. 4 whereas the Bayesian 
analyses conducted in both of those studies support a different resolution (Hamerkop 
sister to pelicans + shoebill). The Bayesian analyses in Prum et al. (2015) reflect the 
use of ExaBayes (Aberer et al. 2014) whereas Reddy et al. (2017) used both 
ExaBayes and MrBayes (Ronquist et al. 2012); thus, these results are not associated 
with specific software. It is unclear whether this topological difference reflects the 
details of the specific programs used for analyses or more fundamental differences 
between ML analyses (where the parameters used for analyses are assigned the 
values that result in the optimal likelihood score) and Bayesian analyses (where the 
method integrates over the uncertainty in those parameters, assuming some prior 
distribution for the parameter values). Regardless, the observed differences between 
ML and Bayesian analyses as implemented in commonly used phylogenetic 
programs deserve further scrutiny. The other branch (p in Fig. 4) is somewhat 
more variable across analyses. Both of these remaining questions regarding core 
waterbird phylogeny deserve attention in coming analyses of phylogenomic data. 

Monophyly of core landbirds (clade 1, also called Telluraves; Yuri et al. 2013) 
is strongly supported in almost all multigene analyses with sufficient taxon sampl¬ 
ing that have been published since Hackett et al. (2008). Like the core waterbirds, 
core landbirds comprise some of the most widespread and familiar avian 
lineages, including raptors (hawks, eagles, falcons, and owls), songbirds and allies 
(Passeriformes, typically called passerines), parrots, and the woodpeckers and their 
allies (such as rollers, kingfishers, bee-eaters, hornbills, hoopoes and woodhoopoes, 
trogons, and other lineages). The Jarvis et al. (2014) TENT, like a number of prior 
analyses (e.g., Ericson et al. 2006; Hackett et al. 2008; Kimball et al. 2013), splits 
core landbirds into two clades that Ericson (2012) named Australaves and Afroaves 
(Table 3). That split renders the raptorial lineages para- or polyphyletic. Specifically, 
the falcons are placed in Australaves sister to parrots and passerines whereas the 
hawks, eagles, New World vultures, and owls are placed in Afroaves as the succes¬ 
sive sister groups of a clade comprising mousebirds, and a diverse assemblage that 



174 


E. L. Braun et al. 


includes woodpeckers and their allies (Fig. 2a). However, the Jarvis et al. (2014) 
TENT conflicts with the Prum et al. (2015) topology with respect to the position of 
the clade comprising hawks, eagles, and New World vultures (Accipitriformes); the 
TENT (and Reddy et al. 2017) places accipitriforms sister to all other Afroaves 
whereas Prum et al. (2015) places them sister to all other core landbirds (Fig. 2b). 
The Jarvis et al. (2014) analysis of first and second codon positions supported a third 
topology, with a clade comprising accipitriforms and owls sister to all other core 
landbirds. Although there is some conflict between the Jarvis et al. (2014) exon 
analysis and the Prum et al. (2015) tree, this suggests the position of accipitriforms 
could reflect another data-type effect. 

Although differences between the results of analyses using coding vs. noncoding 
data have emerged as a major source of conflict in the avian tree of life, differences 
due to data-type effects are not sufficient to explain all of the conflicts at the base of 
core landbirds; analytical methods also play an important role (Fig. 5). Analyses of 
noncoding data using multispecies coalescent (MSC) methods (“species tree” 
methods; see below in Sect. 3) yield trees that are more congruent with standard 
concatenated analyses of exon data, either placing an accipitriform-owl clade or 
accipitriforms alone sister to all other landbirds (Fig. 5). Thus, the position of 
accipitriforms and owls represents an interesting case of uncertainty related to two 
different factors: (1) data type and (2) whether the analytical method assumes a 
single underlying tree or a mixture of trees (for additional details regarding analytical 
methods, see below in the next section). 

Mousebirds represent an equally striking source of conflict (Fig. 5). Many 
analyses, regardless of data type or analytical approach, place mousebirds sister to 
the diverse clade comprising woodpeckers and their allies (named Cavitaves by Yuri 
et al. 2013; see Fig. 5). The analysis of UCEs in Jarvis et al. (2014) and the analysis 
of TE insertions Suh et al. (2015) are the major exception; analyses of both of those 
data types support monophyly of Afroaves, but they place mousebirds sister to the 
other taxa in that clade. However, earlier multigene analyses recognized mousebirds 
as a rogue taxon, shifting to various positions within core landbirds depending on the 
analytical approach, taxon sample, and data (Suh et al. 2011; Wang et al. 2012; 
McCormack et al. 2013). Those analyses include positions sister to or even within 
Australaves or sister all other landbirds. Like tropicbirds and eurypygiforms, 
mousebirds represent a long branch that can only be subdivided close to the tip. 
This led Suh (2016) to favor the Fig. 5d topology by arguing that the UCE and TE 
analyses are more resistant to the long-branch attraction artifact (see below in Sect. 
3). However, Gilbert et al. (2018) found that the position of mousebirds was unstable 
when they applied data filtering approaches designed to reduce noise to the Jarvis 
et al. (2014) UCE data. The observation that analyses of UCE data are sensitive to 
filtering does not refute the Fig. 5d topology. However, it does question one of the 
examples of congruence advanced by Suh (2016) for favoring that topology (i.e., the 
congruence of the UCE and TE analyses). These conflicts regarding the position of 
mousebirds are especially frustrating given the excellent fossil record of this lineage 
(Ksepka and Clarke 2009; Ksepka et al. 2017). However, when the available 
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Fig. 5 Relationships within core landbirds depend on data type and analytical approach. We 
present four candidate topologies that have been recovered when different data types and analytical 
approaches are used. Multispecies coalescent (“species tree”) analyses are indicated using “MSC”; 
all other analyses used concatenated data, (a) Division into two major clades (Australaves and 
Afroaves; see Table 3), present in the Jarvis et al. (2014) TENT and the Reddy et al. (2017) analyses 
of concatenated noncoding data, (b) Accipitriformes sister to all other core landbirds, found in 
the binned MP-EST (Mirarab et al. 2014a) analysis of intronic data reported by Jarvis et al. (2014) 
and the Prum et al. (2015) concatenated tree. The Kimball et al. (2013) NJ st analysis had as 
similar topology (the asterisk indicates a minor rearrangement placing mousebirds sister to owls). 
This topology was also supported by unbinned MP-EST analyses of three different noncoding 
datasets in Edwards et al. (2017), although the taxon sampling in that study was limited, (c) An 
acciptriform + owl clade sister to all other core landbirds, found in binned MP-EST analysis of the 
TENT data and unbinned MP-EST analysis of introns by Jarvis et al. (2014). Reddy et al. (2017) 
also found this topology in their concatenated analyses of a 104-locus coding data matrix (their 
“Prum noJar” tree), (d) Division into two major clades but mousebirds sister to all other Afroaves, 
found in the Jarvis et al. (2014) concatenated UCE tree and the Suh et al. (2015) analysis of TE 
insertions. Cavitaves is defined as Piciformes, Coraciiformes, Bucerotiformes, Trogoniformes, and 
Leptosomiformes (Yuri et al. 2013) 


analyses are taken as a whole, it seems the positions of acciptriforms, owls, and 
mousebirds within core landbirds should all be approached with caution. Regardless 
of the details, core landbirds appear to represent yet another major avian clade where 
additional phylogenomic analyses will provide surprises and insights. 
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3 Why Are the Deep Branches in Neoaves So Difficult? 

The motivation for developing phylogenomic methods was the hypothesis that 
analyses of large data matrices would result in a fully resolved tree of life (Gee 
2003). However, as described above, simply collecting more data does not appear to 
be sufficient to resolve the tree of life with confidence. Arguably, the failure of 
phylogenomics to provide a convincing and simple resolution of the tree of life 
should have been expected. Long before it was possible to collect phylogenomic- 
scale data, mathematical studies (e.g., Felsenstein 1978; Hendy and Penny 1989) and 
simulations (e.g., Hillis et al. 1994) had revealed that some analyses of large-scale 
data matrices fail to converge on the correct tree. Much of the early theoretical work 
focused on identifying conditions where specific phylogenetic methods converge 
on an incorrect topology with confidence when additional data are added. This 
suggests that one might expect analyses of genome-scale data to exhibit a high 
degree of support, at least when support is assessed using standard methods (i.e., the 
nonparametric bootstrap; Felsenstein 1985). In sharp contrast to this simplistic 
expectation, many analyses of large data matrices actually result in low support 
and conflicting results. 

In this section we provide a brief review of those foundational results in theoreti¬ 
cal phylogenetics and connect those early results to observed conflicts in the bird tree 
(see above). However, we emphasize that we do not view the base of Neoaves (or the 
bird tree in general) as unique; indeed, these types of conflicts have been observed in 
phylogenomic studies focused on other groups (e.g., Philippe et al. 2011; King and 
Rokas 2017; Pease et al. 2018), and we expect many additional challenging nodes to 
be identified across the tree of life during the phylogenomic era. Hinchliff et al. 
(2015) synthesized published phylogenetic information for all organisms, including 
taxa that range from microbes to mammals and birds, and they found 4610 nodes that 
conflict with their taxonomy. Some of those conflicting nodes are likely to reflect 
data-limited studies or cases in which the taxonomy was incorrect, but they also 
highlighted a few cases that reflect conflicts among phylogenomic studies. The 
Hinchliff et al. (2015) study underscores the work that still needs to be done to 
resolve the tree of life, even in the phylogenomic era. However, the base of Neoaves 
is among the best studied phylogenetic problems that remain unresolved. Therefore, 
it seems likely that understanding the reasons for these continuing difficulties in 
resolving the bird tree will have general implications for the resolution of other 
challenging nodes on the tree of life. 

Although many systematists have suggested that complex analytical methods 
may be necessary to arrive at a satisfying resolution of the difficult nodes in the tree 
of life (e.g., Philippe et al. 2011; Reddy et al. 2017; Steel 2005), the biological basis 
of those difficult nodes is actually quite straightforward. The nodes most difficult to 
infer are those associated with short internal branches (int in Fig. 6a). This reflects 
the fact that, ultimately, all phylogenetic methods (including both parametric and 
nonparametric approaches) rely on the existence of characters that unite taxa 
(synapomorphies; cf. Hennig 1966). Given even the simplest models of evolution, 
the probability that a synapomorphic substitution uniting a specific group exists is 
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Fig. 6 Potential sources of nonhistorical signals relevant to phylogenomic analyses, (a) Almost all 
of the most challenging nodes in the tree of life reflect short internal branches (int) that provide little 
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directly linked to the length of the internal branch uniting that group (Braun and 
Kimball 2001). However, the terminal branch lengths (term in Fig. 6a) also play an 
important role. All other things being equal, longer terminal branches result in a 
higher probability that (1) subsequent substitutions will obscure synapomorphic 
substitutions and (2) convergent substitutions will create the appearance of 
synapomorphies that unite other groups. Thus, correctly inferring the topology for 
short internal branches deep in the tree will be more challenging than inferring short 
branches closer to the tips. In fact, reconstructing the phylogeny for deep branches is 
impossible if the rate of accumulation for substitutions exceeds a specific value 
(Mossel 2003), a particularly pessimistic finding for phylogenomicists. 

The extreme case in which phylogenetic reconstruction is impossible is unlikely 
to be the case for the data types typically used for avian phylogenomics (i.e., those 
used by Jarvis et al. 2014). Chojnowski et al. (2008) used simulations to show that 
analyses of sequences evolving at a rate similar to avian introns could resolve a tree 
with branch lengths similar to those at the base of Neoaves; even as little as 32 kb of 
simulated intron data could yield trees with an average of only one rearrangement. 
Introns are the most rapidly evolving data type analyzed by Jarvis et al. (2014). 
However, examining the results of Jarvis et al. (2014), Prum et al. (2015), and Reddy 
et al. (2017) in light of the earlier Chojnowski et al. (2008) study raises another 
important question: why is there so much incongruence among those studies given 
the large size of the data matrices in each study? The conflict within the Jarvis et al. 
(2014) study is especially troubling. Chojnowski et al. (2008) found that analyses of 
simulated exon data did not perform as well as analyses using simulated intron data. 
However, that study simulated a maximum of 8000 base pairs (bp) of exonic data; 
the Jarvis et al. (2014) exon datasets were three orders of magnitude larger than that. 
Thus, the observed conflicts at the base of Neoaves must reflect much larger issues 
than simply the overall rate of sequence evolution, the time between cladogenic 
events at the base of Neoaves, or even the amount of available data. 

Long-Branch Attraction, Changes in Patterns of Sequence Evolution, and Data 
Types Two major phenomena with the potential to explain the observed conflicts 
were identified long before the phylogenomic era using mathematical approaches 
and simulations: long-branch attraction and shifts in the model of sequence evolu¬ 
tion. Highly unequal rates of evolution (e.g., Fig. 6b) are thought to be a major 
source of long-branch attraction (Felsenstein 1978). This potential source of 
misleading signal is actually one of the reasons that many systematists advocate 


Fig. 6 (continued) problems associated with short internal branches, (b) Example of long-branch 
attraction. Branch lengths reflect numbers of substitutions per site (c) Changes in model parameters, 
illustrated for this tree by focusing on shifts in the equilibrium GC-content. (d) Example of a 
heterotachous tree mixture. The number of substitutions per site for the locus associated with the 
black gene tree differs from the number of substitutions per site for the gray tree, (e) Example of a 
tree mixture that has multiple topologies (potentially generated by the MSC). The black and gray 
gene trees have the same topology but different branch lengths and the dashed gene tree has a 
distinct topology 
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breaking up long branches by adding taxa (Bergsten 2005), although there are many 
reasons that adding taxa is likely to be beneficial. However, highly unequal rates are 
not absolutely necessary for long-branch attraction; Hendy and Penny (1989) found 
cases where the MP criterion is inconsistent (i.e., it converges on an incorrect tree in 
expectation) even for when the data conform to the molecular clock. Long-branch 
attraction has often been viewed as a problem associated with the MP criterion that 
can be solved by parametric methods (e.g., Huelsenbeck 1997; Swofford et al. 2001), 
such as ML and Bayesian inference. However, a number of studies have shown that 
long-branch attraction can mislead those parametric methods when the model used 
for analyses is incorrect (e.g., Gaut and Lewis 1995; Lockhart et al. 1996), a fact 
that should be troubling given that “true underlying models” of evolution are both 
unknown and ultimately unknowable (Sanderson and Kim 2000). Regardless, the 
fundamental finding of this theoretical work is that large amounts of sequence 
data generated by a single model have the potential to converge on an incorrect tree 
with high support. The body of older theoretical work should also give pause to 
systematists who focus only on high support values in as much as those support 
values could be inflated. However, the results of recent phylogenomic studies do not 
conform to the expectation of relatively simple artifacts like long-branch attraction 
because many challenging nodes on the tree of life receive low support even when 
very large datasets are analyzed. It is now clear that the real-world evolution of 
genomic data cannot be characterized by a single model, making it reasonable to 
speculate the limited support observed in recent phylogenomic studies reflects the 
existence of a complex mixture of evolutionary processes rather than a simple artifact. 

Patterns of sequence evolution with the potential to mislead phylogenetic 
methods are not limited to long-branch attraction; there are myriad model violations 
that might mislead available analytical methods. The most obvious and easy model 
violation to detect is a case in which the base composition changes across the tree 
(Fig. 6c). Indeed, the results of some phylogenetic analyses appear to be driven 
largely by convergence in base composition (e.g., Katsu et al. 2009; Phillips et al. 
2004), and some authors have suggested that genes with variable base composition 
(often called “nonstationary” base composition) should be excluded from phyloge¬ 
netic analyses for this reason (e.g., Collins et al. 2005; Jeffroy et al. 2006). The 
general time reversible (GTR) model is the most commonly used model in 
phylogenomics; it assumes that base frequencies remain constant (i.e., stationary) 
over time. However, many other shifts in the patterns of substitution are possible, 
and some do have the potential to affect phylogenetic estimation. For example, 
there is evidence that the rate matrix (the relative rates of various substitutions 
types, including the transition-transversion ratio and the relative rates of different 
transitions and transversions) can change across trees (e.g., Ota and Penny 2003). 
Those changes in model parameters can degrade the performance of ML analyses 
using standard models like GTR (Casanellas and Femandez-Sanchez 2007). It is 
unclear whether changes in base composition or in the rate matrix across the tree will 
necessarily lead to nodes with limited support. It is clear, however, that the bird tree 
exhibits both highly unequal branch lengths (Fig. 7) and variation in model 
parameters (Fig. 7 shows variation in GC-content). This variation may, at least in 
part, explain the limited support for specific avian clades in phylogenomic analyses. 
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Fig. 7 Phylogram of the Prum et al. (2015) ML tree emphasizing shifts in GC-content and 
evolutionary rate. Taxa with the most extreme values for the GC-content of the parsimony 
informative sites and the “fast sites” (sites in the 95th percentile for number of MP steps given 
this tree) are indicated using colored arrows (red for high values and blue for low values). The 
median GC-content for informative sites was 46.8% (range 43.6-50.2%), and the median 
GC-content of the fast sites was 44.7% (range 34.9-54.2%). The files used for these analyses can 
be found in Braun (2018). Some taxa with very high GC-contents were also long branches (long 
branches indicate lineages with elevated substitution rates). We also emphasize lineages with 
branch lengths that differ substantially from their sister groups: (1) rails and allies (which are sister 
to cranes and allies), (2) hemipodes (which are sister to the gulls, skuas, and allies), (3) a kingfisher 
(nested within the order Coraciiformes), and (4) the woodpeckers and allies (sister to jacamars and 
puffbirds). We also emphasize the “birds of prey,” as defined in Jarvis et al. (2014), because they 
have shortest branches (i.e., lowest evolutionary rates) within Telluraves (core landbirds). Many 
high-rate taxa are characterized by long branches in analyses of other data (e.g., compare this 
phylogram to Fig. 4 in Reddy et al. 2017). We have indicated taxa in Columbea and Passerea, the 
two major clades within Neoaves in the Jarvis et al. (2014) TENT, on the tree to emphasize that they 
are non-monophyletic in the Prum et al. (2015) tree 
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Much of our discussion regarding conflicts within Neoaves (see above) focused 
on data-type effects. Data-type effects are not a phylogenetic artifact like long- 
branch attraction or base compositional shifts; they are simply a way to discuss 
different topological signals that emerge in phylogenetic analyses using distinct 
subsets of the genome that can be defined using non-phylogenetic criteria. The 
fundamental idea is that distinct subsets of the genome might exhibit different 
degrees of model violation. If the model used for analysis is not violated by a subset 
of the genome (or, more likely, the model is only violated to a modest degree), then 
analyses of that subset are likely to yield an accurate estimate of phylogeny; if the 
model violation is strong, then analyses could yield an inaccurate estimate of 
phylogeny. Reddy et al. (2017) defined their data types crudely (i.e., coding 
vs. noncoding regions), but one could subdivide the genome more finely. For 
example, it would seem logical to subdivide noncoding data into transcribed and 
non-transcribed regions, whereas coding regions might be subdivided using protein 
structure (in fact, Pandey and Braun 2018 recently reported a data-type effect linked 
to protein structure for the base of Metazoa). Regardless, the fundamental reason that 
Reddy et al. (2017) proposed data-type effects was to provide a framework for 
exploring and discussing variation in the phylogenetic signal evident in different 
parts of the genome; the actual reason(s) why analyses of any particular data type 
might yield an incorrect estimate of phylogeny will relate to specific model 
violations associated with each data type. 

It might seem that one could overcome data-type effects by conducting analyses 
that apply different models to each data partition. Partitioned analyses are common 
practice in phylogenomics (Lanfear et al. 2014), and partitioned analyses could, at 
least in principle, solve data-type effects if one could identify adequate models for 
each partition. However, the idea that Reddy et al. (2017) articulated is that there 
may not be any models that yield accurate estimates of phylogeny for a specific data 
type (or, at the very least, none of the models that are “good enough” have been 
implemented in a software package that is practical to use for phylogenomic 
analyses). Most programs used in phylogenomics, like RAxML and MrBayes, 
only implement the GTR model and its submodels (typically in combination with 
methods to describe among-sites rate heterogeneity like T-distributed rates and/or 
invariant sites). This has led to the use of the GTR model (or a submodel) in almost 
all empirical studies (Sumner et al. 2012). If the GTR model is fundamentally 
problematic for analyses of one or more of the data type(s), then partitioned analyses 
that apply the GTR model to each partition will also be problematic. The only 
solution would be excluding the problematic data or using models that differ from 
the GTR model in a more fundamental way. Efforts to do the latter are ongoing; for 
example, IQ-TREE (Nguyen et al. 2015) is fast enough for phylogenomic studies 
(Zhou et al. 2018), and it implements a broader suite of models than many other 
programs. If these newer models have a better fit to the data that are poorly described 
by the GTR model and yield accurate estimates of phylogeny for those data, then 
partitioned analyses (using the appropriate models) should ameliorate data-type 
effects. 
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Mixtures of Gene Trees That Reflect Heterotachy, Incomplete Lineage Sorting, 
or Reticulation Another challenge for phylogenomic studies is the fact that geno¬ 
mic data may reflect a mixture of trees rather than a single tree (Maddison 1997). The 
simplest case is a mixture of trees with the same topology but different branch 
lengths (Fig. 6d), a phenomenon also called heterotachy (Lopez et al. 2002). 
Heterotachy in protein-coding regions is often thought to reflect shifts in selective 
constraints, causing sites to transition between a state in which substitutions can 
accumulate, and a second state in which substitutions are removed by purifying 
selection (Penny et al. 2001). However, regional variation in mutation rates is well 
characterized in birds (e.g., Axelsson et al. 2005), and shifts in the mutation rate for a 
specific neutral region will also result in heterotachy. Regardless of the biological 
basis for heterotachy, analyses of heterotachous data using standard (i.e., 
non-heterotachous) ML methods can be misleading (Matsen and Steel 2007), some¬ 
times resulting in a “mixed branch repulsion” analogous to long-branch attraction. A 
more extreme tree mixture involves gene trees with different topologies (Fig. 6e). 
Those tree mixtures are biologically realistic; a number of processes, such as 
incomplete lineage sorting (ILS) and hybridization, can result in mixtures of trees 
with different topologies (Maddison 1997). Edwards (2009) pointed out that ILS 
actually results in both heterotachy and discordant trees; the heterotachy reflects 
variation in coalescence times for gene trees with the same topology. In fact, some 
gene trees with the same topology as the species tree still reflect a deep coalescence 
(DC) in which the split in the gene tree occurs prior to multiple speciation events 
(e.g., the gray gene tree in Fig. 6e); branch lengths of those trees will certainly differ 
from the non-DC trees. Mixtures with multiple topologies arise when DC gene trees 
are discordant with the species tree (e.g., the dashed gene tree in Fig. 6e). There is 
strong direct evidence for heterotachy (e.g., note the very different terminal branch 
lengths for BDNF and PPP2CB in Fig. 8), although the impact of heterotachy on 
estimates of avian phylogeny using standard ML methods remains unclear. The 
evidence for discordance among gene trees due to ILS is indirect, but there is a 
strong theoretical basis for expecting ILS whenever there are short branches in a 
species tree. The impact of ILS on estimates of the bird tree obtained using standard 
ML methods also remains uncertain. 

The theoretical and computational phylogenetics community has put a tremen¬ 
dous amount of effort into the development of MSC (“species tree”) methods over 
the past decade (reviewed by Edwards 2009; Edwards et al. 2016; Liu et al. 2009; 
Wamow 2018). MSC methods are designed to infer the correct species tree given 
mixtures of gene trees due to ILS. Standard ML analyses using concatenated gene 
sequences implicitly assume a single underlying tree with a specific set of branch 
lengths. Thus, standard ML analyses violate the MSC model, and those analyses will 
converge on an incorrect estimate of the true species tree in certain parts of parameter 
space (Kubatko and Degnan 2007; Mendes and Hahn 2017; Roch and Steel 2015). 
Although the fact that MSC methods are consistent (i.e., they converge on the correct 
tree in expectation) is viewed as a desirable property, some authors have raised 
concerns about the criterion of statistical consistency. Warnow (2015) pointed out 
that available proofs of consistency for MSC methods actually focus on a “weak 
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Fig. 8 Two examples of gene trees from Reddy et al. (2017), emphasizing differences in relative 
rates and base composition. Support values from an ultrafast bootstrap (Minh et al. 2013) analysis in 
IQ-TREE are shown next to branches when they are >70%. (a) Tree based on an intron in the 
PPP2CB locus. This is one of the few individual gene trees that divides Neoaves into Columbea and 
Passerea. As in Fig. 7, the GC-content for informative sites is indicated with colored arrows (red for 
the six most GC-rich taxa and blue for the six least GC-rich taxa). Although the median GC-content 
for informative sites (45.4%) does differ from the GC-content for constant sites (49.7%), this locus 
exhibits limited GC-variation overall (range for informative sites = 41.5-58.4%). (b) Tree based on 
part of the BDNF coding exon. This locus exhibits substantial rate variation (note the very long 
branches for the bee-eater, Darwin’s finch, and tinamou). Like PPP2CB, the median GC-content for 
informative sites (47.5%) differs from the GC-content of constant sites (52.5%). However, this 
region also exhibits substantial GC-content variation (range for informative sites = 33.9-84.8%). 
Most nodes in this gene tree have limited support, similar to the PPP2CB gene tree (and most trees 
based on short gene regions), but the BDNF tree does includes some strongly supported clades. 
However, some of those strongly supported clades contradict monophyly of Palaeognathae and 
Neoaves, which are united by very long branches in all other estimates of the avian species tree. The 
extreme GC-content and rate variation suggest that these conflicts may reflect a biased estimate of 
phylogeny. All data supporting this figure is available in Braun (2018) 





































































184 


E. L. Braun et al. 


B 


Chicken ◄- 
Turkey 


Loon 

- Bustard 

- Grebe 
- Hornbill 


Killdeer 
— Fulmar 
i— Ibis 
\_rj 2 Seriema 
'- Woodpecker 


[7e" E 9 ret 


— 42.4 C 


Cormorant 

Trogon ◄—41 

Pelican 

Adelie penguin 
^|| Emperor penguin 
Falcon 

Cuckoo ◄—42.4% 
— Sun bittern 


— Hoatzin 
White-tailed eagle 
Bald eagle 
Turaco 


BDNF 

coding exon 
length = 688 nt 



33.9% 

Zebra finch < 


Flamingo 
— Tropicbird 

Nightjar 

Hummingbird 
Swift ◄—56.8% 



0.03 

substitutions per site 




PALAEOGNATHAE 


GALLOANSERES 

NEOAVES (Passerea) 
NEOAVES (Columbea) 


] 


— Darwin’s finch ◄— 
Bee-eater ◄—75.4% 


71.2% 

— Tinamou 


◄— 84.8% 


Fig. 8 (continued) 


version” of statistical consistency, showing only that “tree estimated by the species 
tree method will converge in probability to the true species tree as the number of sites 
per locus and the number of loci both increase.” Even this weak version of statistical 
consistency assumes data were generated by gene trees that reflect the MSC model 
along with a model of sequence evolution for which available methods are consis¬ 
tent. Springer and Gatesy (2016) pointed out that the MSC model ignores important 
aspects of genome evolution such as selection and linkage. Even as ubiquitous a 
phenomenon as population subdivision will yield a different spectrum of gene trees 
than expected given the simple MSC model (Slatkin and Pollack 2008). It is 
certainly possible that the real-world processes underlying genome evolution are 
close enough to the assumptions that underlie MSC methods that those methods will 
yield an accurate estimate of the species tree; indeed, this is generally assumed to be 
the case when those methods are applied. However, a recent study examined the fit 
of 25 datasets to the MSC model, finding that 20 of those datasets violate the model 
(Reid et al. 2013). Ultimately, it should be clear that the criterion of consistency only 
yields guarantees in the abstract world of mathematics, not in the real world of 
biology. 
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Given their growing use in phylogenomics, it is important to understand the types 
of MSC methods available at this time, despite the concerns we expressed regarding 
their justification by appealing to their statistical consistency. Modem MSC methods 
often yield trees that are quite congment with trees based on standard (concatenated) 
ML analyses (Tonini et al. 2015), and some are actually less computational burden¬ 
some than those ML approaches. Xu and Yang (2016) reviewed MSC methods, 
highlighting two basic approaches: (1) full-likelihood methods and (2) gene tree 
summary methods. Xu and Yang (2016) also described (but did not name) a third 
approach that we call site pattern methods. Full-likelihood methods integrate over 
the uncertainty in gene trees (this is the approach Maddison 1997 originally 
suggested). At this time the full-likelihood approach has been implemented in 
BEST (Liu 2008), *BEAST (Heled and Drummond 2010), RevBayes (Hohna 
et al. 2016), and BPP (Rannala and Yang 2017); all of those programs use a 
Bayesian Markov chain Monte Carlo approach, and none are able to scale to 
phylogenomic analyses similar in size to Jarvis et al. (2014). Gene tree summary 
methods involve two steps: (1) a standard method (e.g., ML) is used to generate gene 
trees and (2) the estimated gene trees are combined to generate the species tree. 
Examples of gene tree summary methods include MP-EST (Liu et al. 2010), 
ASTRAL (Mirarab et al. 2014b; Mirarab and Wamow 2015; Zhang et al. 2018), 
and REt/ASTRID (Liu and Yu 2011; Vachaspati and Warnow 2015). There are 
tools to visualize discordance among estimated gene trees, either as networks or 
using other approaches (e.g., Ottenburghs et al. 2016b; Sayyari et al. 2018). Many 
phylogenomic studies, including a number focused on birds (e.g., Kimball et al. 
2013; McCormack et al. 2013; Jarvis et al. 2014; Edwards et al. 2017), have used 
gene tree summary methods. The practice of estimating gene trees and then combin¬ 
ing those trees is actually imposing less computational burden than standard 
ML analyses of concatenated loci. Finally, site pattern species tree methods use a 
concatenated data matrix as input. However, they differ from standard analyses of 
concatenated data by decomposing the data matrix into quartets (SVDquartets; 
Chifman and Kubatko 2014) or rooted triples (SMRT-ML; DeGiorgio and Degnan 
2010) and identifying the optimal tree for those subsets of taxa. Obviously, the 
method used to infer the quartet (or rooted triplet) subtrees must be consistent given 
the MSC for the approach to be viewed as a species tree method. However, methods 
that are consistent for those subtrees (under at least some circumstances) do exist 
(DeGiorgio and Degnan 2010; Long and Kubatko 2017). Use of singular value 
decomposition to choose the subtrees (the criterion used by SVDquartets) is also 
very fast computationally. After generating the subtrees, they are combined using a 
supertree method such as MRP (Baum 1992; Ragan 1992) or Quartet MaxCut (Snir 
and Rao 2012) to generate the species tree. Although the use of site pattern methods 
remains less common than the gene tree summary methods, they have begun to 
attract attention in avian phylogenomics (e.g., Hosner et al. 2015b; Meiklejohn et al. 
2016; Moyle et al. 2016; Sun et al. 2014). 

Hybridization represents another major source of discordance among gene trees. 
Hybrids have been documented in most bird orders (Ottenburghs et al. 2015, 2017b), 
and hybridization can impact phylogenetic estimation for recent radiations (e.g., 
Lavretsky et al. 2014). Phylogenomics is likely to revolutionize the study of these 
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complex radiations (Lamichhaney et al. 2015; Grant and Grant 2016; Stryjewski and 
Sorenson 2017). However, the amount of gene tree discordance due to introgression 
in many lineages may be limited by the fact that hybrids often have lower fitness than 
parental types (e.g., Bronson et al. 2003). Two phenomena which might be espe¬ 
cially important for generating mixtures of gene trees that reflect hybridization are 
(1) hybrid speciation and (2) despeciation. Hybrid speciation reflects cases where 
hybrid populations become reproductively isolated from both parental species. The 
Golden-crowned manakin appears to be an example of such a hybrid species; -2/3 of 
its genome is more closely related to the Opal-crowned manakin, whereas -1/3 is 
more closely related to the Snow-capped manakin (Barrera-Guzman et al. 2018). 
Despeciation refers to the fusion of two related species, resulting in a single species 
and erasing the initial speciation event. Kearns et al. (2018) presented phylogenomic 
evidence that Common ravens arose by fusion of two raven lineages that diverged 
ca. 1.5 Mya and would likely be viewed as species if they were extant (“Holarctic” 
and “California”). Chihuahuan ravens diverged from the California lineage, while it 
was isolated from the Holarctic birds. Both of these phenomena lead to a spectrum of 
gene trees that differ from the expectation given ILS alone (see Fig. 3 in Hahn and 
Nakhleh 2016). 

If we focus deeper in evolutionary history, the impact of hybridization is expected 
to be more difficult to examine: gene tree estimation error might make it virtually 
impossible to distinguish the descendants of lineages that underwent limited 
amounts of ancient hybridization from those with purely treelike history (combined 
with ILS). However, the expectation that sex-linked genes will exhibit less intro¬ 
gression than autosomal genes (Rheindt and Edwards 2011) might be useful for 
distinguishing conflict due to ILS from conflict due to introgression. Indeed, Fuchs 
et al. (2013) used this to explain a difference between autosomal and Z-linked genes 
they observed in woodpeckers. That study used seven autosomal loci, three Z-linked 
loci, and mitochondrial sequences, so it would be interesting to reexamine it in a 
phylogenomic framework. Another test uses the expectation that the two minority 
topologies for rooted triplets in gene trees will be recovered in equal numbers if ILS 
alone is responsible for discordance; given the failure of a collection of true gene 
trees to observe this equality would lead one to reject treelike evolution under ILS 
alone. However, errors in estimated gene trees can either produce or obscure 
inequalities in the numbers of gene trees with each minority resolution, limiting 
the utility of the inequality test. To address this, Zwickl et al. (2014) proposed a 
“cumulative support distribution” test that incorporates information about support in 
the gene trees. Developing practical tests that are able to establish whether the null 
hypothesis of ILS alone can explain discordance among gene trees represents a 
major challenge in the phylogenomic era. 

Rare Genomic Changes A fundamental problem for studies focused on discor¬ 
dance among gene trees is the indirect nature of the evidence. Unlike evolutionary 
rate heterogeneity and heterotachy, in which it is relatively straightforward to find 
direct evidence for the phenomena, the evidence for ILS and introgression is indirect 
because it depends on the comparison of gene trees. However, phylogenetic trees 
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estimated using short sequences, like individual genes, are subject to substantial 
estimation error (e.g., Chojnowski et al. 2008; Gatesy and Springer 2014; Mirarab 
et al. 2014a; Patel et al. 2013) and could be subject to systematic error (e.g., if there is 
strong evolutionary rate variation and/or nonstationary base composition). Thus, 
observing incongruence among estimated gene trees does not provide direct evi¬ 
dence of ILS or introgression because that incongruence could reflect error. This is 
true even for loci with limited evolutionary rate and GC-content variation (e.g., 
Fig. 8a); it is certainly true for gene trees for loci with substantial rate and 
GC-content variation (e.g., Fig. 8b). 

Rare genomic changes, which correspond to a heterogeneous set of slowly 
accumulating changes in the genome, provide an alternative means to examine 
ILS and introgression that has the potential to be better than the use of estimated 
gene trees. Ideally, rare genomic changes represent uniquely derived genomic 
characters (i.e., homoplasy-free changes that are subject neither to reversal nor to 
convergence). Genuinely homoplasy-free genomic characters would define a single 
branch in their associated gene tree perfectly. Analyses of avian ILS have used three 
types of rare genomic changes: (1) TE insertions, (2) numt insertions, and (3) indels 
(insertions and deletions) as a whole. TE insertions are the most commonly used 
(e.g., Haddrath and Baker 2012; Suh et al. 2011). Many conflicting TE insertions 
have been identified in birds (Han et al. 2011; Jarvis et al. 2014; Matzke et al. 2012; 
Suh et al. 2015, 2011, 2017); this observed conflict has typically been interpreted 
as direct evidence of ILS. Unfortunately, precise quantification of ILS using TE 
insertions is difficult because (1) the rate of TE insertion is quite variable over time 
(Kapusta and Suh 2017); (2) informative TE insertions are relatively rare even 
when they are scored at a whole-genome scale (Suh 2015); and (3) avian TE 
insertions do not appear to be completely free of true homoplasy (Han et al. 2011). 
Unlike TE insertions, the sole numt study (Liang et al. 2018) did not reveal any 
homoplasy. However, Liang et al. (2018) only identified a small number of informa¬ 
tive numt insertions and that study included a limited taxon sample. If we expand our 
focus to indels as a whole, which are much more numerous than TE or numt 
insertions, Jarvis et al. (2014) predicted that discordance due to ILS would yield a 
positive relationship between internal branch lengths and the proportion of apparent 
synapomorphies mapping to each of those branches that appear non-homoplastic. 
That exact relationship was observed, although that approach cannot provide a 
quantitative estimate of ILS. Regardless, the observed levels of conflict among 
rare genomic changes indicate there was a relatively large amount of ILS near the 
base of Neoaves. 

Analyses of rare genomic changes could be revolutionized by truly whole- 
genome phylogenetics. Many analyses reported by Jarvis et al. (2014) actually did 
not require whole-genome sequencing. For example, sequence capture could 
(at least in principle) have been used to generate the data for the TENT. In contrast, 
generating rare genomic change data would have been much more difficult (poten¬ 
tially even impossible) without genome sequencing. Identifying TE insertions prior 
to the phylogenomic era involved labor-intensive methods necessary for their 
identification (described by Haddrath and Baker 2012). Bioinformatic screens of 
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whole-genome assemblies can reveal TE insertions without the need for complex 
laboratory methods. Repetitive sequences, like TEs, can be challenging to assemble, 
but this can be solved by using platinum-quality genome sequences (Kapusta and 
Suh 2017; Weissensteiner and Suh 2018). It is also challenging to target numt 
insertions, but they are straightforward to identify by searching whole-genome 
data. However, because numt insertions are nonfunctional mitochondrial sequences 
in the nuclear genome, they have a relatively high likelihood of being misassembled 
in genomes based exclusively on short-read data (because some numt reads are likely 
to cluster with mitochondrial reads). Thus, platinum-quality genome sequences 
should improve numt scoring relative to most currently available assemblies. Finally, 
the availability of more avian genome assemblies could allow the use of other types 
of rare genomic changes. Microinversions are one of these types of rare genomic 
change that are impossible to identify without sequence data (Braun et al. 2011). 
Microinversions technically include all inversions that cannot be identified cytolog- 
ically (Chaisson et al. 2006). However, the sole avian microinversion study (Braun 
et al. 2011) focused on short (<50 bp) inversions, finding that they accumulate at a 
rate comparable to TE insertions and suggesting that they are likely to be as useful 
for phylogenetics as TE insertions. Large-scale identification of microinversions will 
revolutionize their use and provide another way to examine discordance among gene 
trees. We anticipate that the phylogenomic era will lead to a flood of rare genomic 
change datasets. 

Rare genomic changes are interesting from a computational standpoint because 
the optimal tree for ideal rare genomic changes is the MP tree (Steel and Penny 
2004, 2005). MP is orders of magnitude more computationally efficient than ML 
(or Bayesian) methods (Sanderson and Kim 2000). It is unclear whether the MP tree 
for a collection of ideal rare genomic changes generated on the tree mixture 
generated by the MSC will be the species tree. However, Mendes and Hahn 
(2017) have shown that MP analyses of concatenated data are consistent given the 
MSC (assuming certain assumptions are made), providing reasons for optimism if 
one views the criterion of consistency as critical (for detailed arguments against the 
position that consistency is a necessary feature of phylogenetic methods, see Brower 
2018; Sanderson and Kim 2000). Of course, there are two reasons that empirical rare 
genomic change datasets will not be “perfect” (i.e., absolutely homoplasy-free): 
(1) the relevant type of rare genomic change could exhibit some true homoplasy 
and (2) errors in the genome assembly and/or orthology detection pipeline. It may be 
necessary to develop analytical methods that consider those sources of error. Never¬ 
theless, it seems reasonable to postulate that rare genomic changes will make it 
possible to estimate the species tree and amount of ILS accurately. 

Assessing Support and Confronting Theory with Phylogenomic Data Despite 
the large body of theoretical work, we do not have a complete explanation for the 
observed results of avian phylogenomic studies. The strongest type of theoretical 
studies, proofs of consistency or inconsistency (e.g., Felsenstein 1978; Hendy and 
Penny 1989; Kim 2000; Matsen and Steel 2007; Roch and Steel 2015; Mendes and 
Hahn 2017), provides information about the behavior of specific analytical methods 
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given an infinite amount of data that were generated under a specific model. Kim 
(2000) discussed an elegant geometric interpretation of phylogenetic methods that 
have the interesting corollary that nonparametric bootstrap support for clades should 
increase to 100% as the sample size increases. If the analytical method is consistent 
given the true underlying model of evolution, the correct clade will exhibit 100% 
support given sufficient data; if the data reflect a part of parameter space where the 
method is not consistent, an incorrect clade is expected to exhibit 100% support. 
However, neither Jarvis et al. (2014) nor Prum et al. (2015) observed 100% bootstrap 
support for all clades. This raises the question of how many sites are necessary to 
provide a “sufficient amount” of data to observe the expected asymptotic behavior. 
Many empirical systematists would have predicted that an alignment of 41.8 Mbp 
(the size used to generate the Jarvis et al. 2014 TENT) would be sufficient for all 
nodes to converge on 100% support (recognizing that some nodes might be resolved 
incorrectly due to inconsistency). But there are nine nodes in the TENT that have 
<100% support (and one has only 55% support). Increasing the data matrix size 
more than sevenfold does not eliminate those low-support branches; six nodes with 
< 100% support remain present in the Jarvis et al. (2014) whole-genome tree (four of 
those nodes have 62% support). Thus, the limited support for clades at the base of 
Neoaves given available analytical methods appears to reflect an intrinsic property of 
the data rather than a trivial limitation in the amount of data. 

The fact that analyses of a very large (even genome-scale) dataset can yield 100% 
bootstrap support for an incorrect clade may seem disturbing. However, it is a natural 
outcome when the analytical method is not consistent. By definition, a consistent 
estimator converges on the true value (in phylogenetics this would be the true tree) 
as the amount of data available for analysis increases (as more of the genome is 
sampled). If an estimator is based on a fundamental misunderstanding of the 
processes that generated the data, then the method is unlikely to be consistent, at 
least in some parts of parameter space. It should not be surprising that analyses using 
such a method could lead to an incorrect conclusion with 100% support; after all, we 
defined the analytical method as one based on fundamentally incorrect assumptions 
regarding the processes that generated the data. This raises a question: are there 
phylogenomic methods that are immune to this issue? There have certainly been 
efforts to create metrics of support for the phylogenomic era. For example, Seo 
(2008) extended the standard bootstrap to a multilocus framework (ASTRAL 
includes an easy to use implementation of this multilocus bootstrap). The general 
idea of subsampling genes has also attracted attention (Edwards 2016); 
bootstrapping is often used to assess support in locus subsampling studies. There 
are other methods like concordance factors (Ane et al. 2007; Baum 2007) and the 
information theoretic measures of Salichos et al. (2014), which instead focus on 
examining the agreement among gene trees. The latter is especially interesting 
because it has pointed out that it can also be applied to rare genomic changes. It is 
important to recognize that measures of agreement among gene trees do not provide 
information regarding the support for clades in the species tree. If data fit the MSC, 
one may still have a high degree of confidence regarding the presence of a specific 
clade in the species tree even when a relatively small proportion of gene trees agree 
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with that clade (Pamilo and Nei 1988). If enough gene trees are sampled, multilocus 
bootstrapping (Seo 2008) and local posterior probabilities (Sayyari and Mirarab 
2016) can both yield strong support for clades that only agree with a plurality of 
gene trees. However, all methods exhibit the same behavior as the standard bootstrap 
when the analytical method is inconsistent (i.e., incorrect clades will receive 100% 
support if sufficient data are analyzed and those data reflect a part of parameter space 
where the method is inconsistent). But all is not lost; many trees are both robust to 
the analytical method (e.g., Rindal and Brower 2011) and very likely to be correct. In 
fact, the story of avian phylogenomics is arguably one of steady progress, with parts 
of the tree that would have been viewed as hopeless a decade ago now emerging as 
“solved” in a satisfying manner (e.g., Fig. 2). It is the parts of the tree that remain in 
flux despite the availability of very large amounts of data that represent a problem: 
we must either conclude that the 41.8 Mbp analyzed by Jarvis et al. (2014) is not 
sufficient to observe the expected asymptotic behavior (i.e., 100% support at all 
nodes) or that there is something about the data and available methods of phyloge¬ 
netic analysis that we do not understand. 

There are four basic hypotheses that can explain the limited support for major 
groups at the base of Neoaves in multiple phylogenomic analyses (Fig. 2). First, 
errors in the data matrices (e.g., assembly, orthology assignment, and alignment) 
could introduce noise. If this hypothesis is correct, the limited support should 
disappear as the quality of the aligned dataset is improved, either by removing 
problematic regions (e.g., using alignment-filtering methods like DivA; Mendoza 
et al. 2014) or by extracting data from improved genome assemblies that lead to less 
downstream error. Second, the poor support could reflect limitations of the available 
computational implementations of methods. If the bird data lies very close to 
“boundaries” with even minor differences in numerical optimization during the 
calculation of the likelihood, the methods might choose different trees in different 
bootstrap replicates. However, this would require all of the very large Jarvis et al. 
(2014) datasets to be in parts of parameter space where those computational issues 
manifest themselves. Third, it could the case that the heterogeneous nature of the 
data tends to obscure phylogenetic signal. If this hypothesis is correct, it might be 
possible to improve estimates of the avian tree by focusing on less noisy parts of the 
genome. Reddy et al. (2017) presented several arguments that noncoding data, such 
as introns and UCEs, might perform better in phylogenetic analyses than coding 
exons. The strongest empirical argument favoring noncoding data was the fact that 
trees based on rare genomic changes (one for TE insertions and one for all indels) are 
more similar to the intron and UCE trees. Nonetheless, it would clearly be more 
convincing to identify additional data types that yield trees congruent with the 
noncoding trees. Finally, as described above, the base of Neoaves could represent 
a hard polytomy (Suh 2016). If Suh (2016) is correct, then individual gene trees will 
be random for the lineages involved in the hard polytomy. The hard polytomy 
hypothesis predicts that any relatively small set of gene trees are unlikely to be 
very congruent. Suh (2016) proposed a nine-taxon hard polytomy, and there are 
>135,000 possible rooted nine-taxon trees. However, Reddy et al. (2017) found 
that analyses of a largely intronic 54-locus dataset yielded a tree similar to the Jarvis 
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et al. (2014) intron tree (and the TENT, which was 68% noncoding data). Reddy 
et al. (2017) also observed that analyses of a largely coding 104-locus dataset that 
did not overlap with any loci in Jarvis et al. (2014) results in a tree similar to the 
Jarvis et al. (2014) exon tree. Those results seem unlikely if the low support at the 
base of Neoaves reflects a hard polytomy; if the hard polytomy hypothesis is correct, 
it would require that analyses of data reflecting relatively small sets of gene trees 
coincidentally converge on two specific parts of tree space that were identified by 
Jarvis et al. (2014) in this manner is correlated with data type. We suggest that 
understanding the heterogeneity present in genomic datasets will ultimately be 
necessary to obtain a well-supported estimate of the avian tree of life. 

The Impact of Genome Assembly Quality in the Phylogenomic Era All of the 

issues discussed above focus on the behavior of analytical methods, raising an 
important question: what is the role for platinum-quality genome assemblies in 
avian phylogenomics? After all, draft genome assemblies typically capture at least 
90% of most bird genome sequences, even when they reflect very low-coverage 
sequencing (e.g., Tiley et al. 2018). The increased amount of data available in 
platinum-quality genome assemblies is unlikely to have much direct impact on 
phylogenomic analyses. However, high-quality genome assemblies are likely to 
improve dataset quality. Springer and Gatesy (2018) reported that many alignments 
analyzed by Jarvis et al. (2014) had homology errors (e.g., the inadvertent alignment 
of exons to introns due to incorrect gene annotation). The existence of problematic 
alignments in a phylogenomic dataset is not unexpected; any computational pipeline 
used to generate a phylogenomic data matrix will yield both false positives (i.e., it 
will align some nonhomologous sequences) and false negatives (i.e., it will fail to 
align some truly homologous sequences). The greater contiguity and smaller number 
of misassemblies in platinum-quality genome assemblies could make it easier to 
extract orthologous sequences. Improved genome annotations will also make it 
easier to extract orthologous regions. Functional data, such as RNA-seq, can be 
very important for genome annotation (Roberts et al. 201 1), and it is accumulating at 
a rapid pace (e.g., Seki et al. 2017). Iso-seq data have the potential to be even more 
helpful because it reflects long-read sequencing technologies to generate full-length 
mRNA sequences (Gonzalez-Garay 2016), unlike RNA-seq data that reflect short 
reads that do not define complete transcripts in a direct manner. Both Iso-seq data 
and platinum-quality genome assemblies are now being generated for birds (e.g., 
Korlach et al. 2017; Workman et al. 2017). Better annotation will also highlight 
changes in gene structure (e.g., precise intron deletions; Coulombe-Huntington and 
Majewski 2007); this will provide another type of rare genomic change (Bleidorn 
2017). Finally, better genome assemblies will also permit better analyses of genome 
structure and content (e.g., inversions, rearrangements, gene losses, and gene 
duplications). Methods to use gene order information for phylogenetic estimation 
already exist (Hu et al. 2014), and all that is needed are genome assemblies of 
sufficient quality. Ultimately, resolving the most difficult questions in the avian tree 
of life is likely to require improved genome assembly and annotation, the extraction 
of multiple data types, and improved analytical methods. 
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4 Progress Toward Species-Level Avian Megaphylogenies 

Resolving the bird tree of life actually involves two related but somewhat distinct 
research programs: (1) the resolution of difficult nodes, like the base of Neoaves, and 
(2) the generation of large-scale trees that include all bird species. The latter is 
complicated by evidence that many avian taxa should be considered species but are 
not currently recognized as such (Gill 2014; Barrowclough et al. 2016). The decision 
to assign the rank of species to taxa depends on the species concept one chooses to 
employ (see Ottenburghs 2019 for a recent review of species concepts). Nonetheless, 
it is likely that the number of evolutionary entities (regardless of whether or not those 
sometimes-cryptic taxa are assigned the rank of species) will ultimately increase 
from the -10,000 bird species that are recognized in most current taxonomies 
(Table 3) by at least twofold (and possibly even increase threefold). The Brown 
et al. (2017) megaphylogeny is the only existing bird tree with more than 10,000 
taxa, although it is unclear how many of those taxa will ultimately be assigned a 
status equivalent to species upon detailed study. Nevertheless, generating large-scale 
avian trees (megaphylogenies or simply “big trees”) that include most or all of the 
currently recognized avian species would represent an excellent starting point for the 
next phase of avian phylogenomics. 

The available avian megaphylogenies (Table 4) have two major limitations: 
(1) most fail to include all named bird species and (2) none of them reflect analyses 
of phylogenomic-scale data. The reason that many megaphylogenies exclude 
species is simple: molecular data are absent for a number of species. The complete 
megaphylogenies (Brown et al. 2017; Jetz et al. 2012) incorporate data-deficient 
species using taxonomic information. Two megaphylogenies (Hedges et al. 2015; 
Jetz et al. 2012) also used strong backbone constraints (i.e., a number of 
relationships were fixed). Other big trees (Brown et al. 2017; Davis and Page 
2014) lack meaningful branch lengths because they reflect the synthesis of published 
trees rather than a direct analysis of any molecular (or morphological) data. Burleigh 
et al. (2015) is the only avian megaphylogeny generated using a purely empirical 
approach; it reflects an unconstrained ML analysis of orthologous avian sequences 
published before June 2011. The newest megaphylogeny (Brown et al. 2017) reflects 
a synthesis of published trees (i.e., a supertree) so it does incorporate phylogenomic 
data in the form of source trees generated using phylogenomic data; in fact, it has a 
backbone identical to the Prum et al. (2015) tree (Fig. 2b). 

The value of phylogenomic data, especially whole-genome data, for the genera¬ 
tion of a species-rich bird tree is actually an open question, especially if we choose to 
include more than 10,000 terminal evolutionary lineages. The additional information 
available in whole-genome sequences may not justify data collection costs or the 
computational burden of assembling, annotating, and analyzing whole-genome data 
when we are near the tips of the tree. Sequence capture methods (e.g., Table 2) 
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Table 4 Avian megaphylogenies generated by synthesizing data from multiple sources 


Study 

Number of 
neomithine taxa 

Analytical approach 

Method 15 

Branch 

lengths 0 

Jetz et al. (2012) 

6670/9993“ 

Supermatrix (15 regions) 

BI 

time 

Davis and Page (2014) 

5379 

Supertree analysis 

MRP 

n/a 

Burleigh et al. (2015) 

6714 

Supermatrix (29 regions) 

ML 

mol 

Hedges et al. (2015) 

7163 

Synthesis of divergence times 

TTOL 

time 

Brown et al. (2017) 

13,579 e 

Supertree analysis 

RH 

n/a 


Analytical approaches are supermatrix (analysis of concatenated sequence data), supertree (com¬ 
bination of published trees), and synthesis of divergence times (refining a taxonomy using 
published timetree data) 

b BI Bayesian inference (with constraints in the case of Jetz et al. 2012), ML maximum likelihood, 
MRP matrix representation with parsimony (Baum 1992; Ragan 1992), RH Redelings and Holder 
(2017) supertree pipeline, TTOL “time tree of life” approach (Hedges et al. 2015) 
c Braneh lengths available as estimates of absolute time or molecular change (substitutions per site), 
“n/a” indicates that branch lengths are not available 

d Two Jetz et al. (2012) trees are available. One comprises 6670 taxa for which at least some 
sequence data were available. The other comprises 9993 taxa, and it includes taxa for which no data 
are available; taxa without any associated sequence data were placed using taxonomic information 
e The Brown et al. (2017) tree includes taxa that are not assigned the rank of species in current 
checklists 


sidestep problems associated with homology assignment to a fairly large degree 
because they focus on assembling relatively short contiguous regions; there is no 
need to annotate the genome and extract the relevant data types. Moreover, it is 
usually possible to recover complete or virtually complete mitochondrial genome 
sequences even if mitochondrial sequences are not targeted by probes (Meikejohn 
et al. 2014; Raposo do Amaral et al. 2015; Wang et al. 2017); alternatively, 
low-coverage genome sequencing can yield mitochondrial genome sequences 
(Barker et al. 2015). Mitochondrial sequences are likely to be especially valuable 
near the tips of trees (Barrowclough and Zink 2009). At least in the near term, it 
seems likely that sequence capture will contribute substantial amounts of data to 
avian megaphylogenies. 

The data included in current megaphylogenies is heterogeneous in quality, and 
this has no doubt led to topological (and, when they are available, branch-length) 
errors in those trees. The impact of those errors on downstream analyses is unclear, 
and it probably depends on the specific analyses that are conducted. Indeed, in a 
study focused on a single family (woodpeckers; about 200 species), Dufort (2016) 
found at least 28 sequences for 10 different species in Jetz et al. (2012) that have 
been (or could possibly be) assigned to other species. Burleigh et al. (2015) fared a 
little better, including only 14 problematic sequences for 5 species. These errors 
largely appear to reflect difficulties associated with reconciling the NCBI taxonomy 
with current species limits, although Dufort (2016) also noted that both big trees 
included a cytochrome b sequence that Fuchs et al. (2008) deemed a pseudogene. 
Although these findings should raise concerns, the more important question is 
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whether the phylogenetic errors caused by these issues influence downstream 
analyses. Wang et al. (2016) provided empirical evidence that errors can influence 
downstream analyses (in this case biogeographic inference) by showing that a single 
misplaced species in the Jetz et al. (2012) megaphylogeny had a large impact on their 
conclusions. The practice of placing data-deficient taxa using taxonomic informa¬ 
tion, which has been criticized on theoretical grounds (Rabosky 2015; Title and 
Rabosky 2017), is potentially a bigger problem. There are clades in Jetz et al. (2012) 
with posterior probabilities of 1.0 that depend on the position of taxa with no data 
that conflict with clades that received 100% bootstrap support in an ML analysis of 
eight nuclear loci and three mitochondrial regions (Hosner et al. 2015a). These 
problems are not unique to Jetz et al. (2012); even those megaphylogenies that 
eschew the use of data-deficient taxa suffer from problems inherent to the synthesis 
of diverse data sources. The only solution is the collection of much larger datasets 
from all bird species; there are ongoing efforts, like the B10K project (Zhang et al. 
2015) and the OpenWings project (Pennisi 2018), that aim to produce the requisite 
sequence data. 

The exercise of generating an accurate, taxon-rich phylogeny of birds is not an 
end in itself. Rather, it is a necessary component of comparative studies that 
addresses evolutionary pattern and process. Trees allow us to ask about the evolu¬ 
tionary opportunities, constraints, and processes that led to the biodiversity we now 
observe. Comparative methods reveal whether observed patterns in biodiversity data 
or correlations among traits require a deeper explanation or have the potential to 
be explained by simple null hypotheses. Studies using these methods have revealed 
patterns in biogeography (e.g., Claramunt and Cracraft 2015; Field and Hsiang 
2018; Moyle et al. 2016; Wang et al. 2016), rates of diversification (e.g., Ricklefs 
2007; Jetz et al. 2012), and many other types of phenotypic changes (e.g., Cooney 
et al. 2017; Hosner et al. 2017; Field et al. 2018; Stoddard et al. 2017). Phylogenetic 
trees can inform conservation priorities (e.g., Diniz-Filho et al. 2013; Jetz et al. 
2014). Trees are also a necessary component of analyses that range from those 
focused on examining sequence conservation and the genomic landscape (e.g., 
Botero-Castro et al. 2017; Zhang et al. 2015) to the relationship between whole- 
organism traits and patterns of molecular change, such as evolutionary rates, amino 
acid substitutions, and base composition (e.g., Zhang et al. 2014; Berv and Field 
2018; Weber et al. 2014a, b). Phylogenomic studies have proven to be the most 
revealing, surprising, and reliable of all sources of phylogenetic information. This 
assertion may seem surprising given the conflicts among phylogenomic studies that 
we have highlighted (Fig. 2), but phylogenomic trees exhibit substantially more 
topological similarities than trees generated before the phylogenomic era. Indeed, it 
has been only by virtue of conflicts that we have better come to understand the nature 
and complexity of the evolutionary and historical processes that appear to have 
misled earlier studies, both molecular (e.g., heterotachy, nonstationarity, ILS, and 
hybridization) and morphological (e.g., convergent evolution; cf. Mayr 2008; 
Sackton et al. 2018). 
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5 Where Do We Go From Here? The Future of Avian 
Phylogenomics 

We anticipate many challenges as the field of phylogenetics moves from analyses of 
“phylogenomic-scale” datasets, like those generated by sequence capture methods 
(Table 2), to analyses of truly whole-genome scale. The Jarvis et al. (2014) analyses 
of 48 bird genomes required more than 400 years of CPU time; 42 of those CPU 
years were devoted to the 322-Mbp whole-genome tree. Obviously, using the same 
methods of analysis for the -10,000 bird species recognized at this time will be 
impractical. Filtering genome alignments to focus on the type(s) of data that are most 
likely to provide an accurate estimate of evolutionary history may prove to be 
necessary, thus further exploration of data-type effects will be no doubt be helpful. 
Of course, the assertion that a specific data type yields a topology close to the true 
tree is a hypothesis; finding ways to corroborate specific hypotheses regarding data¬ 
type effects represents a major challenge for the field of phylogenomics. In principle, 
improved models of sequence evolution could address data-type effects, but efforts 
to develop complex (and presumably very computationally-demanding) models may 
actually take us in the wrong direction. We argued that rare genomic changes might 
be an extremely valuable source of information; rare genomic change data could 
provide another solution to the computational problem because it might be possible 
to use MP for their analyses without sacrificing accuracy. Moreover, the use of 
rare genomic change data intrinsically reduces whole-genome alignments to much 
smaller and, therefore, easier-to-analyze data matrices. This shifts the computational 
challenges away from the tree search and toward the identification of rare genomic 
changes; those computational challenges are likely to be further ameliorated by the 
availability of platinum-quality genome assemblies with high-quality gene annotations. 
Obviously, there may be other solutions to the computational challenges associated 
with phylogenetic analysis of complete avian genomes, but the fact that it is possible 
to envision some practical ways forward makes us sanguine that the challenges can 
be solved. 

In summary, it is clear that the avian tree of life will grow substantially over the 
next few years, and we can expect that nearly all named taxa will eventually be 
included. As we alluded to in our discussion of megaphylogenies, there is now a 
major impetus among systematists globally to identify more and more “cryptic taxa” 
(whatever their rank). This might easily swell avian taxonomic diversity from the 
-10,000 species that are currently recognized to more than 30,000 evolutionary 
entities. Investigators of avian diversity and biology will want all of these taxa to find 
a home on the avian tree. Throughout this review, we have generally been agnostic 
regarding the direction that analytic methods will take in the future, simply describ¬ 
ing the results of various analyses and, in some cases, the strengths and weaknesses 
of those methods. However, we believe any comprehensive avian tree should 
ultimately reflect the results of an empirical approach that links the data and the 
tree in a direct manner, such as the global supermatrix approach akin to the Burleigh 
et al. (2015) study. Ideally, that tree would incorporate data from rare genomic 
changes as well as sequence information. This will present huge computational 
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challenges, and it is unclear, for reasons we articulated above, how the community 
will meet these challenges. Increasing incorporation of more taxa and more 
characters has always called for shifts in analytical thinking, and these shifts will 
no doubt continue to happen. Important questions in avian biology will be asked and 
answered using many different kinds and amounts of data. Some of those questions 
will require genome-scale data; some will even require reference quality genome 
assemblies. However, it will remain possible to address many innovative questions 
without genome-scale data. With that in mind, it is important that ornithologists 
focus on developing and framing the innovative questions in avian biology; there is 
little doubt that genome-scale data for birds will become broadly available and be 
combined with dense taxon sampling. We expect that genome-scale data will make it 
possible to answer new questions and push the frontiers of avian biology forward 
over the coming years. 
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Abstract 

What is a species? This seemingly simple question has occupied the minds of 
numerous biologists and philosophers, resulting in the formulation of many 
species concepts. From a theoretical point of view, the species problem has 
been resolved by equating species with independently evolving lineages 
(i.e. the evolutionary species concept or the general lineage concept). However, 
the practical issues with describing and delineating species remain. The origin of 
species is a gradual process that typically requires thousands to millions of years, 
creating a grey zone of species delimitation in which taxonomy is often contro¬ 
versial. To account for this, an integrative taxonomy has been proposed in which 
different taxonomic concepts and methods are integrated in the delimitation of 
species. In this chapter, I argue that genomics provides another line of evidence in 
this pluralistic approach to species classification. Indeed, genomic data can be 
combined with classical species criteria, such as diagnosability, phylogeny and 
reproductive isolation. First, genomic data can provide an extra diagnostic feature 
in species delimitation. Compared to ‘old-school’ genetic markers, the use of 
genome-wide markers leads to a significant rise in statistical power. Second, 
phylogenomic analyses can resolve the evolutionary relationships within rapidly 
diverging or hybridizing groups of species while taking into account gene tree 
discordance. Third, genomic data can be used to pinpoint the genetic basis of 
reproductive isolation and provide a detailed description of the speciation pro¬ 
cess. All in all, the genomic era will supply avian taxonomists with a new tool box 
that can be applied to old concepts, leading to better informed decisions in 
cataloguing biodiversity. 
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1 Introduction 

The debate on the definition of a species, commonly known as the species problem 
(Richards 2010), has produced an insurmountable quantity of literature. Yet, 
Darwin’s (1859, p. 44) description still holds today: ‘No one definition has yet 
satisfied all naturalists; yet every naturalist knows vaguely what he means when he 
speaks of a species’. The debate continues because of two main difficulties: (1) the 
confusion between species concepts and species criteria and (2) the continuous 
nature of species. 

The first difficulty highlights the fact that the species problem consists of two 
major issues: the theoretical question of what species are (i.e. species concepts) and 
the ways in which species can be delimited in practice (i.e. species criteria). It is 
important to realize that species concepts are not species criteria (Hey 2006). Many 
taxonomists have accepted the view that species are lineages (Mayden 1997; De 
Queiroz 1998, 1999), thereby resolving the first theoretical issue. In contrast to this 
theoretical triumph, the practical aspect of the species question is still an active field 
of research. There are many ways in which species can be described, including 
morphological, ecological and genetic approaches. How should these different lines 
of evidence for the status of particular taxa be interpreted? And what is the role of 
genomics in species discovery and delimitation? 

The second difficulty, the continuous nature of species, is a direct consequence of 
the gradual process of speciation. The origin of new species typically requires 
thousands to millions of years, creating a grey zone of species delimitation in 
which taxonomy is often controversial (Roux et al. 2016). In this grey zone, different 
species criteria might result in diverse decisions on the assignment of species ranks. 
The developing field of speciation genomics might provide solutions for taxonomists 
dealing with the continuous nature of species (Seehausen et al. 2014; Campbell et al. 
2018). 

In this chapter, I will explore the role of genomics in the debate of species 
concepts and species criteria. First, I will show how others have made substantial 
progress on these issues by adopting theoretical monism—species as lineages—and 
practical pluralism, paving the way for an integrative taxonomy (Padial et al. 2010). 
Then, I will introduce the most commonly used species criteria in avian taxonomy 
(distinguishability, reproductive isolation and monophyly) and how they can be 
integrated in species delimitation (Sangster 2014). Finally, I will show how genomic 
data can be incorporated into taxonomic decisions. Some authors have proposed a 
‘genomic species concept’ (e.g. Jarvis 2016). I will argue that there is no need for 
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another species concept; rather, genomic data can be incorporated in the existing 
species criteria. 


2 Species Concepts Are Not Species Criteria 

During the twentieth century, there has been a proliferation of species concepts. 
May den (1997) lists no less than 22 distinct species concepts, including some 
familiar ones such as the biological species concept (Mayr and Ashlock 1991), the 
phylogenetic species concept (Cracraft 1983) and the recognition species concept 
(Paterson 1993). Despite this plethora of species concepts, the so-called silver bullet 
species concept, one that is universally applicable, was not achieved. A universal 
species concept could not be formulated because of two main issues: (1) the plural¬ 
istic nature of species and (2) the tension between conceptualization and delimitation 
(Hey 2006). First, the proliferation of species concepts is a direct consequence of the 
diversity of life: different taxonomic groups require different species concepts 
depending on their particular characteristics (Mayden 1997). For instance, the 
biological species concept, which stresses reproductive isolation between members 
of different species, cannot be applied to asexually reproducing organisms. Second, 
the issue of species conceptualization is often confused with the issue of species 
delimitation (Mayden 1999): concepts are theories or ideas that are general and may 
or may not be based on empirical observations, while species delimitation requires a 
prescribed set of repeatable operations that lead to the outcome of whether a certain 
group of individuals represent a species or not. Hey (2006) summarized this issue 
nicely: ‘As scientists we should not confuse our criteria for detecting species with 
our theoretical understanding of the way species exist. Detection protocols are not 
concepts’. 


2.1 Theoretical Monism 

To resolve these issues, Mayden (1997) proposed a hierarchy of species concepts, 
with a primary theoretical species concept and several secondary operational species 
concepts. He argued that only the evolutionary species concept is suitable as primary 
concept, because it is theoretically robust and generally applicable. This concept 
states that a species is ‘an entity composed of organisms which maintains its identity 
from other such entities through time and over space, and which has its own 
independent evolutionary fate and historical tendencies’ (Wiley and Mayden 
2000). The remaining concepts are secondary, functioning as guidelines that are 
essential for the study of species in practice (Wiens and Servedio 2000; Sites and 
Marshall 2004). Together, the primary and secondary species concepts form a 
hierarchical system. 

Similarly, De Queiroz (1998, 1999) reviewed several existing species concepts 
and argued that all existing species concepts are variants of a single general concept, 
which he dubbed ‘the general lineage concept’. Species are considered separately 




214 


J. Ottenburghs 


Fig. 1 This simplified 
diagram represents a single 
lineage splitting into two 
independently evolving 
lineages (or species). The 
horizontal lines represent the 
times at which the lineages 
acquire different properties 
(e.g. they become phenetically 
distinguishable, 
reproductively isolated, 
reciprocally monophyletic, 
etc.). This set of properties 
(SC species concept) 
coincides with a grey zone in 
which alternative species 
concepts come into conflict. 
On either side of the grey 
zone, there is agreement on 
the number of species. 
Adapted from De Queiroz 
(2007) 
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evolving metapopulation lineages. A lineage indicates 4 an ancestor-descendant 
series’, and metapopulation refers to an ‘inclusive population made up of connected 
subpopulations’ (De Queiroz 2007). The difference between the existing species 
concepts lies in the specific criterion that they emphasize, such as reproductive 
isolation (Mayr and Ashlock 1991), systems of mate recognition (Paterson 1993) 
or diagnosability (Cracraft 1983). These criteria should not be regarded as necessary 
for species delimitation, but rather as contingent upon the evolutionary history of the 
taxa. Moreover, during the speciation process, there will be a grey zone in which 
different species criteria come into conflict (Fig. 1). Here, a ‘life history approach’ is 
warranted, in which different species properties correspond to different stages in the 
life history (i.e. speciation process) of a species (Harrison 1998). It is important to 
keep in mind that the order in which species properties arise is contingent upon the 
speciation process. In some cases, morphological differentiation evolves first, 
followed by reproductive isolation, while, in other cases, it might be the other way 
around. By acknowledging the continuous nature of speciation and realizing that 
consequently species criteria might lead to conflicting outcomes in this grey zone, 
taxonomists can be better informed in decisions on the number of species. 

So, both De Queiroz (1998, 1999) and Mayden (1997) reached a similar con¬ 
clusion, albeit using a different approach (Hey 2006; Naomi 2011). The species 
problem can thus be partly resolved by theoretical monism (the evolutionary species 
concept or general lineage concept) in combination with practical pluralism, in 
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which different species criteria correspond to different stages during the speciation 
process. 


2.2 Practical Pluralism in Avian Taxonomy 

The view that species are (segments of) lineages emphasizes the unity of various 
rival approaches to species delimitation and recognizes that taxonomy is necessarily 
pluralistic because species have different characteristics. Sangster (2014) examined 
taxonomic bird studies published between 1950 and 2009. He compared the appli¬ 
cation of six species criteria in the description of new species, proposals to change 
the taxonomic rank of species and subspecies and the taxonomic recommendations 
of the American Ornithologists’ Union Committee on Classification and Nomen¬ 
clature. The six criteria can be divided in three classes. They are based on several 
species concepts and are not viewed as necessary for the delimitation of species. I 
will briefly introduce these criteria and the species concepts they relate to 
(De Queiroz 1998, 1999; Sangster 2009). 

1. Distinguishability 

la. Diagnosability 

If a taxon has a unique, fixed character or a unique combination of character 
states, it can be considered a separate species. This criterion is based on the 
diagnostic approach (Baum and Donoghue 1995) of the phylogenetic species 
concept (Cracraft 1983). Nixon and Wheeler (1990) describe diagnosability as 
follows: ‘the smallest aggregation of populations (sexual) or lineages (asexual) 
diagnosable by a unique combination of character states in comparable 
individuals’. 

lb. Degree of difference 

Some authors argue that taxa should be recognized as distinct species if they 
differ ‘too much’ from other related taxa. The differences may refer to morpho¬ 
logical, biometric, behavioural or genetic characters. Hence, the degree of differ¬ 
ence criterion can be traced back to several species concepts, such as the 
recognition species concept (Paterson 1993), the phenetic species concept 
(Sokal and Sneath 1963) or the genetic cluster concept (Mallet 1995). 

2. Phylogeny 

2a. Monophyly 

The criterion that taxa should be separated to prevent the recognition of a 
paraphyletic species taxon is based on the monophyletic version (de Queiroz and 
Donoghue 1988) of the phylogenetic species concept (Cracraft 1983). 

2b. Exclusive coalescence 

Baum and Shaw (1995) provided a genealogical perspective to the species 
problem and argued that a species is ‘a group of organisms [that] is exclusive if 
their genes coalesce more recently within the group than between any member of 
the group’. This species criterion is often considered a special version of the 
phylogenetic species concept (Baum and Donoghue 1995; Luckow 1995). 
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3. Cohesion 

3a. Adaptive zone 

The adaptive zone criterion is based on ecological differences or the occupa¬ 
tion of different niches, emphasizing the importance of ecologically based natural 
selection in the maintenance of species. This criterion is related to the ecological 
species concept which states that ‘a species is a lineage (or a closely related set of 
lineages) which occupies an adaptive zone minimally different from that of any 
other lineage in its range and which evolves separately from all lineages outside 
its range’ (Van Valen 1976). 

3b. Reproductive isolation 

This criterion can take many forms and depends on the nature of reproductive 
isolation. A distinction is made between pre- and postzygotic isolation 
mechanisms (Coyne and Orr 2004): prezygotic isolation mechanisms act before 
fertilization, whereas postzygotic isolation mechanisms act after fertilization and 
can be either intrinsic or extrinsic. Intrinsic postzygotic isolation mechanisms 
lead to sterility or inviability of the offspring, while extrinsic postzygotic isolation 
mechanisms encompass lower fitness of the offspring for ecological or 
behavioural reasons, not developmental defects. Depending on the specific repro¬ 
ductive isolation mechanism, this criterion can be traced back to several concepts, 
such as the recognition species concept (Paterson 1993), the biological species 
concept (Mayr and Ashlock 1991) or the cohesion species concept (Templeton 
1989). 

The analysis of Sangster (2014) showed that the most frequently used species 
criterion in avian taxonomy is diagnosability, followed by reproductive isolation and 
degree of difference. Furthermore, reproductive isolation is less frequently applied in 
studies of allopatric taxa compared to sympatric or parapatric taxa, suggesting that 
avian taxonomists are reluctant to speculate on the degree of reproductive isolation 
between allopatric populations. One approach of species delimitation of allopatric 
populations concerns the so-called Tobias criteria, in which the divergence in a 
sample of sympatric species is used to calibrate thresholds for species status in 
parapatric or allopatric taxa (Tobias et al. 2010). This procedure was formally 
introduced by Mayr (1969) and later recast in a more quantitative framework by 
Isler et al. (1998), who provided guidelines for species delimitation in antbirds 
(family Thamnophilidae) based on song divergence in several sympatric species 
pairs. Tobias et al. (2010) expanded this procedure to more bird families and argued 
that it ‘can be applied to the global avifauna to deliver taxonomic decisions with a 
high level of objectivity, consistency and transparency’. Nevertheless, this system 
has been criticized, and there is certainly room for improvement (e.g. Donegan 
2018). 

More importantly, however, the analysis of Sangster (2014) revealed that nearly 
half of the studies (46.5%) applied multiple criteria in species delimitation, 
supporting the notion that avian taxonomy is pluralistic. This pluralism is in line 
with propositions for an integrative taxonomy (Dayrat 2005; Padial et al. 2010) in 
which different taxonomic concepts and methods are integrated in the delimitation 
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and description of species. Within this context, two general frameworks are being 
advocated: ‘integration by congruence’ and ‘integration by cumulation’ (Padial et al. 
2010 ). 

The congruence approach to species delimitation entails that different datasets, 
such as molecular and morphological characters, support the decision to recognize 
certain taxa as valid and distinct species (Dayrat 2005; DeSalle et al. 2005). For 
example, Alstrom et al. (2008) used congruence between plumage, biometrics, egg 
coloration, song, mitochondrial DNA and distribution to draw species limits in the 
Bradypterus thoracicus complex. The main advantage of this approach is that most 
taxonomists will agree on the validity of a species supported by several independent 
datasets, leading to taxonomic stability. The limitation of integration by congruence 
is the risk of underestimating species numbers, because speciation does not always 
involve character change at all levels (Adams et al. 2009), specifically in adaptive 
radiations where ecological divergence and reproductive isolation develop at differ¬ 
ent rates (Seehausen 2004; Shaffer and Thomson 2007). 

The alternative framework, integration by cumulation, is based on the assumption 
that divergence in any of the datasets can be used as evidence for the delimitation of 
species. Congruence is desired but not necessary (De Queiroz 2007). In practice, 
evidence from different datasets is cumulated, concordances and discordances are 
explained within the specific evolutionary context of the taxa under study, and based 
on the available evidence a decision is made. An advantage of this approach is that 
species delimitation is not restricted by a particular biological property. The main 
disadvantage is that the uncritical use of a single line of evidence, such as mtDNA, 
can lead to an overestimation of species numbers (Agapow et al. 2004; Isaac et al. 
2004; Meiri and Mace 2007). 

The occurrence of concordances and discordances between species criteria is 
predicted by the general lineage concept (De Queiroz 1998, 1999), because the 
continuous nature of speciation results in the occurrence of certain criteria at 
particular phases during the speciation process. Discordance between species criteria 
is to be expected in the early stages of speciation. Furthermore, which characters 
diverge first is contingent on the evolutionary history of the taxa under investigation. 
For example, in some cases, genetic divergence might precede morphological 
differentiation, while in other cases, it might be the other way around. As speciation 
proceeds, more characters may diverge, resulting in concordance among different 
species criteria. 


3 Enter Genomics 

Genomics has become a standard practice in ornithology (Kraus and Wink 2015; 
Oyler-McCance et al. 2016; Toews et al. 2016; Wink 2019). With genomics I refer to 
the next-generation sequencing techniques discussed in Toews et al. (2016), namely, 
whole genome sequencing and resequencing, reduced representation techniques 
[genotype-by-sequencing (GBS) and restriction-site-associated DNA sequencing 
(RADseq)], sequence capture and RNA sequencing. The impact of these novel 
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sequencing techniques on avian species delimitation has only started to be addressed 
in the literature. 

Several microbiologists have proposed a system of species delimitation solely 
based on genomic data (Konstantinidis et al. 2006; Meier-Kolthoff et al. 2013). 
Jarvis (2016) extended this line of thinking to vertebrates and argued that ‘with the 
availability of extant genomes of all “species” of a vertebrate class, one could define 
or redefine species based on genomic distances’. This approach is not feasible for 
several reasons. First, what threshold should be applied? The decision for a certain 
threshold of minimum divergence seems arbitrary. Furthermore, comparable to the 
>2% divergence rule for mitochondrial DNA barcodes (Johns and Avise 1998; 
Hebert et al. 2004), it might not be applicable to all bird taxa (Lovette 2004; Will 
et al. 2005). Second, how will the genomic distance be calculated? An average over 
the whole genome ignores heterogeneity in divergence across the genome (Ellegren 
et al. 2012; Poelstra et al. 2014; Dhami et al. 2016; Wolf and Ellegren 2017). Indeed, 
divergence is often concentrated in certain ‘genomic islands’ (Michel et al. 2010; 
Cruickshank and Hahn 2014). 

In the remainder of this chapter, I will argue that there is no need for a genomic 
species concept in birds; rather, genomic data can be incorporated in the three classes 
of species criteria described by Sangster (2014), namely distinguishability, phylo- 
geny and cohesion (and particularly the reproductive isolation criterion). 


3.1 Genomics and Distinguishability 

Distinguishability encompasses two species criteria: diagnosability and degree of 
difference. The former criterion emphasizes that a unique, fixed character or a unique 
combination of character states can be used to delineate species, while the latter 
criterion emphasizes that this character or combination of character states should be 
sufficiently different from other related taxa. In both criteria, the main point is the 
observation of certain diagnostic features, which can be morphological, biometric, 
behavioural or genetic characters. 

Genetic markers, such as mtDNA, microsatellites or AFLPs (amplification frag¬ 
ment length polymorphisms), have often been used to discriminate between species 
(Hebert et al. 2004; Sites and Marshall 2004). Genome-wide datasets add more 
power and sensitivity to detect subtle genetic differentiation. For instance, using 
RADseq data, Ruegg et al. (2014a, b) were able to more reliably distinguish between 
eastern and western populations of the Wilson’s warbler (Cardellina pusilla) com¬ 
pared to previous studies based on mtDNA (Kimura et al. 2002; Paxton et al. 2013) 
and AFLPs (Irwin et al. 2011). 

As discussed above, discordance between species properties is to be expected in 
the early stages of speciation. Also genomic data might not be in line with other 
datasets. For example, based on thousands of SNPs, Oyler-McCance et al. (2015) 
found genetic differentiation between the main North American population of 
greater sage-grouse (Centrocercus usophasianus) and a parapatric population in 
California and Nevada (referred to as the ‘bistate’ population). However, 
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morphological and behavioural studies revealed little or no differences between 
these populations (Taylor and Young 2006; Schroeder 2008). An opposite pattern 
was uncovered in redpoll finches (genus Acanthis) where several putative species 
differed phenotypically despite largely undifferentiated genomes (Mason and Taylor 
2015). 

However, as speciation proceeds and more characters diverge, concordance 
among different datasets might arise. For example, a genomic analysis of two 
subspecies of the willet (Tringa semipalmata ) indicated that both subspecies are 
genetically distinct, providing an extra line of evidence, next to ecological, 
behavioural and morphological differences, to treat these subspecies as separate 
species (Oswald et al. 2016). Similarly, Gohli et al. (2015) argue that several 
subspecies in the Afrocanarian blue tit (Cyanistes teneriffae) complex should be 
treated as distinct species because of concordance among different characters, such 
as genomics, morphology, song and sperm morphology. 

These examples show that genomic data is not a defining species criterion but can 
provide an extra diagnostic feature in the description and delimitation of species. 
Compared to ‘old-school’ genetic markers (e.g. mtDNA or microsatellites), genomic 
data result in a significant rise in statistical power to make inferences about species 
ranks. 


3.2 Genomics and Reproductive Isolation 

One of the most commonly applied species criteria in avian taxonomy is reproduc¬ 
tive isolation. This criterion is actually closely related to the most frequently used 
criterion of diagnosability because reproductive isolation is often necessary for the 
fixation and maintenance of diagnostic differences in sympatric and parapatric taxa 
(Sangster 2014). 

Reproductive isolation is mostly caused by the combination of several pre- and 
postzygotic isolation mechanisms. Because these mechanisms often interact, it may 
be difficult to determine the relative importance of each mechanism. Furthermore, 
the present importance of a mechanism might be different from its historical 
importance (Coyne and Orr 2004; Lackey and Boughman 2017). The interplay of 
different reproductive isolation mechanisms can be depicted as a continuum from a 
panmictic population to two irreversibly isolated species (Seehausen et al. 2014). 
Speciation can be driven by divergent sexual or ecological selection, in which case 
extrinsic postzygotic and prezygotic mechanisms act first and intrinsic postzygotic 
mechanisms come into play later in the speciation process (Fig. 2a). Alternatively, 
speciation can be driven by intrinsic postzygotic mechanisms, such as genetic 
incompatibilities. Extrinsic postzygotic and prezygotic mechanisms accumulate 
and reinforce reproductive isolation at a later stage (Fig. 2b). Hendry et al. (2009) 
recognize four stages across the speciation continuum: (1) continuous variation 
without reproductive isolation; (2) discontinuous variation with minor reproductive 
isolation; (3) strong, but reversible, reproductive isolation; and (4) strong and 
irreversible reproductive isolation. It is important to keep in mind that movement 
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3 Speciation driven by divergent selection 



Panmictic Speciation continuum Two irreversibly 

populations isolated species 


b Speciation driven by intrinsic barriers 



Panmictic Speciation continuum Two irreversibly 

populations isolated species 


Fig. 2 The speciation continuum of speciation processes driven by (a) divergent selection, where 
prezygotic and extrinsic postzygotic barriers evolve before intrinsic postzygotic barriers, and (b) by 
intrinsic barriers, where intrinsic postzygotic barriers evolve before prezygotic and extrinsic 
postzygotic barriers. The shapes of the curves are hypothetical. Adapted from Seehausen et al. 
(2014) 

along the speciation continuum is not constant; speciation can go back and forth at 
different speeds or come to a halt at certain stages (e.g. formation of a stable hybrid 
zone). 

In line with the general lineage concept, the speciation continuum emphasizes the 
continuous and contingent nature of speciation. By studying how populations move 
across the species continuum (i.e. by reconstructing their evolutionary history), one 
gains more insights into the speciation process, and one can make more informed 
decisions in the delimitation of particular species. In other words, one needs to study 
the process of speciation. 

The term speciation was first used by Cook (1906) to describe ‘the origination or 
multiplication of species by subdivision, usually, if not always, as a result of 
environmental incidents’. Traditionally, speciation models have been classified in 
a spatial context, namely, the well-known Mayrian triumvirate consisting of allo- 
patry, parapatry and sympatry (Bush 1975; Butlin et al. 2008). In allopatric speci¬ 
ation, the geographic range of a species is split in two or more isolated populations 
that diverge by natural selection and/or genetic drift. Parapatric speciation concerns 
the evolution of reproductive isolation between geographically overlapping 
populations that still exchange genes to a limited extent. Sympatric speciation refers 
to the situation in which new species originate from a single ancestral population 
while inhabiting the same geographical region. From a population genetic per¬ 
spective, allopatric and sympatric speciation are the ends of a continuum of gene 
flow (m), with parapatric speciation in between (Fig. 3, Gavrilets 2004). From this gene 
flow perspective, the latter two geographic speciation models (parapatry and sympatry) 
are combined under the heading divergence-with-gene-flow (Fitzpatrick et al. 2008; 
Pinho and Hey 2010). 

The geographical classification of speciation models has been useful and is still 
widely applied today (Harrison 2012). In addition, some more refined 
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Fig. 3 Speciation models on the basis of geography and gene flow. Each circle represents the 
geographical range of a species (in red or yellow). The orange colour indicates that two species 
overlap in their ranges. Gene flow is expressed as the migration (m) of alleles or gene flow from one 
population to the other and varies between 0 and 0.5. The parapatric and sympatric speciation 
models are usually combined under the heading divergence-with-gene-flow 


geographically inspired speciation models emerged, such as peripatric (Mayr 1982), 
stasipatric (White 1969), centrifugal (Brown 1957), microallopatric (Smith 1965; 
Paulay 1985) or allosympatric speciation (Coyne and Orr 2004; Mallet 2005). 
However, this geographical classification does have its limitations (Butlin et al. 
2008), and other ways to classify speciation models have been proposed (Templeton 
1981; Kirkpatrick and Ravigne 2002; Gavrilets 2004). 

New ways of classifying speciation may result in a proliferation of speciation 
models, similar to the situation on species concepts. Indeed, Kirkpatrick and 
Ravigne (2002) noted that ‘theoreticians have balkanized the subject of speciation’, 
because each mathematical model focuses on a highly specific scenario. A promising 
attempt at an overarching ‘process-based’ classification has been made by 
Dieckmann et al. (2004). They envision speciation as a route through a three- 
dimensional cube (which I will call the ‘speciation cube’), of which the axes 
represent spatial, ecological and mating differentiation (Fig. 4). At the onset of 
speciation, there is no differentiation between the populations, which corresponds 
to the starting point at the origin (i.e. lower left comer). In the classic allopatric 
model, an external cause (represented by a dotted line) leads to spatial differentia¬ 
tion, and the populations consequently diverge under genetic drift (dashed line) or 
selection (solid line), resulting in mating and/or ecological differentiation. 

Sympatric speciation scenarios can also be depicted in this ‘speciation cube’. 
Because no external causes lead to spatial differentiation, the lines are restricted to 
the front plane of the cube. Divergence can be driven by sexual selection (leading to 
assortative mating) or ecological differentiation (Bolnick and Fitzpatrick 2007). A 
textbook example of differentiation by sexual selection has been documented in 
Lake Victoria (East Africa), where several sympatric populations of cichlid fish 
show divergence in male colouration and female preferences (Seehausen et al. 
2008). Differentiation by ecology-based divergent selection is commonly referred 
to as ‘ecological speciation’ (Rundle and Nosil 2005; Nosil 2012). Several classical 
examples of ecological speciation involve sympatric phytophagous insect species 
using different host plants (Berlocher and Feder 2002), such as the apple maggot 
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Fig. 4 A process-based representation of speciation as ‘speciation cubes’. In the cube the axes 
represent ecological (x), mating (y) and spatial (z) differentiation. Divergence between two 
populations can be driven by external processes (dotted line), genetic drift (dashed line) or selection 
(solid line). These speciation cubes can be used to visualize different speciation models (see text for 
further explanation and examples). Adapted from Dieckmann et al. (2004) 


(Rhagoletis pomonella ), which specializes on hawthorns and apples (Feder et al. 
1988, 2003). 

Sympatric speciation in birds is rare and controversial (Phillimore et al. 2008; 
Price 2008). One putative case concerns African indigobirds, which are host-specific 
brood parasites targeting several species, such as red-billed firefinch (Lagonosticta 
sene gala) and African firefinch (L. rubric ata). Males mimic the song of the host, 
while females chose their mates based on song and the nests they parasitize. This 
situation can lead to immediate reproductive isolation when a host-shift occurs, 
culminating in a sympatric speciation event (Sorenson et al. 2003). Given that 
female indigobirds choose their partner based on host song, this particular case 
can be considered sympatric speciation by sexual selection. To my knowledge, 
African indigobirds represent the only well-documented avian example of such as 
scenario. Several macroevolutionary studies found no support for sexual selection 
directly driving (sympatric) speciation (Morrow et al. 2003; Huang and Rabosky 
2014; Cooney et al. 2017). In general, sympatric speciation in birds seems to rely on 
additional factors, such as ecological differentiation. 
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Differentiation by ecology-based divergent selection—ecological speciation— 
has been proposed for several bird taxa (e.g. Caro et al. 2013; Ryan et al. 2007; 
Smith et al. 2011; Zhen et al. 2017). For example, populations of the little greenbul 
(Andropadus virens) are distributed across an ecological gradient between rainforest 
and savannah in sub-Saharan Africa. Along this gradient, or ecotone, selection 
pressures are expected to vary due to differences in environmental factors, such as 
rainfall and tree cover. The resulting divergent natural selection might lead to 
phenotypic differentiation and ultimately speciation. Several studies have shown 
that this is indeed the case for the little greenbul (Smith et al. 1997, 2005; Kirschel 
et al. 2011; Zhen et al. 2017). Ecotones might provide the ideal conditions for 
ecological speciation, as exemplified by other bird studies along similar gradients 
(Ribeiro et al. 2011; Smith et al. 2011; Caro et al. 2013). 

Ecological speciation can be facilitated by so-called magic traits, which refer to 
the situation where traits under divergent natural selection and traits involved in mate 
choice are the same or share a similar genetic background (Servedio et al. 2011). In 
birds, beak morphology is an obvious ‘magic trait’ candidate: divergent selection 
can lead to distinct beak morphologies. Different beaks produce different sounds, 
providing the opportunity for assortative mating. This scenario has been described 
for North American red crossbills ( Loxia curvirostra). This species complex can be 
divided into nine call types based on differences in vocalizations, bill size and palate 
structure (Benkman 1993, 1999). These differences are the result of divergent 
selection for foraging on different conifer species. Hence, each call type inhabits a 
distinct ecological niche (Benkman 2003). Behavioural and genetic studies indicated 
that this ecological specialization contributes to reproductive isolation between 
several call types (Parchman et al. 2006; Smith and Benkman 2007; Smith et al. 

2012) . Other examples include subspecies of the swamp sparrow ( Melospiza 
georgiana , Ballentine et al. 2013) and populations of medium ground finch 
0 Geospiza fortis) on Santa Cruz Island in the Galapagos archipelago (Podos et al. 

2013) . 

A special case of speciation in sympatry is allochronic speciation, in which 
populations diverge because they breed in different times of the year (Taylor and 
Friesen 2017). This scenario has been described for several members of the sub¬ 
family Hydrobatinae (storm petrels), such as Madeiran storm petrels ( Hydrobates 
castro) on the Azores (Monteiro and Furness 1998; Friesen et al. 2007) or Ainley’s 
storm petrel ( H . cheimomnestes) and Townsend’s storm petrel ( H . socorroensis) on 
Guadalupe (Taylor et al. 2018). Other species pairs in this subfamily have probably 
diverged in allochrony after an initial allopatric stage (Wallace et al. 2017). 

Apart from allopatric and sympatric speciation scenarios, ‘speciation cubes’ can 
also be used to depict more complex, often multiphase, speciation processes. For 
example, Fig. 4 shows a scenario in which two populations are first geographically 
isolated and develop partial ecological and mating differentiation in allopatry. Later 
they re-establish contact and further ecological and mating differentiation occurs. 
Such complex speciation scenarios seem to be common in birds: numerous cases of 
secondary contact after a period of geographical isolation have been documented 
(Price 2008; Ottenburghs et al. 2017). A classical example of such a secondary 
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contact is the hybrid zone between the carrion crow (C. corone corone) and 
hooded crow (C. c. cornix) across central Europe (Meise 1928). 

The process-based approach by Dieckmann et al. (2004) complements the life 
history approach to the species problem, discussed above (Harrison 1998). The 
‘speciation cube’ allows for the depiction of many different speciation scenarios, 
each representing the specific life history of a particular species pair. In all cases, 
however, there is the build-up of reproductive isolation between two or more 
populations. Genomic data can be used to provide a genetic basis for these repro¬ 
ductive isolation mechanisms (Wu and Ting 2004; Presgraves 2010). In birds, 
reproductive isolation is generally driven by prezygotic isolation mechanisms. 
Intrinsic postzygotic isolation mechanisms, such as hybrid sterility of unviability, 
evolve at a slower rate (Fitzpatrick 2004). In mammals, male hybrid sterility has 
been linked to PRDM9, a rapidly evolving zinc finger protein that probably plays a 
role in modifying recombination hotspots (Payseur 2016). Interestingly, birds lack 
PRDM9, which might explain the slow evolution of intrinsic postzygotic isolation 
(Singhal et al. 2015; Vignal and Eory 2019). 

One of the newest genomic tools in speciation research are genome scans (Haasl 
and Payseur 2016), which calculate divergence statistics, such as F ST or nucleotide 
diversity (%) in windows across the genome. Analyses of recently diverged species 
have uncovered highly heterogeneous patterns of genetic divergence across the 
genome (Harr 2006; Michel et al. 2010; Ellegren et al. 2012; Nadeau et al. 2012; 
Ravinet et al. 2017; Wolf and Ellegren 2017): some genomic regions show little 
genetic differentiation, while other regions show high levels of differentiation and 
may even harbour fixed differences that could be used to distinguish between 
species. 

The observation of such heterogeneous genomic divergence is often linked to the 
speciation model of divergence-with-gene-flow, during which genetic differentiation 
accumulates in some genomic regions, while the rest of the genome (which is 
assumed to be largely neutral) is homogenized by gene flow (Barton and Bengtsson 
1986; Nosil et al. 2009). Assuming that genes in these differentiated regions may be 
involved in reproductive isolation, these regions have been referred to as ‘genomic 
islands of speciation’ (Turner et al. 2005). 

However, heterogeneous patterns of genetic differentiation can also be explained 
by alternative models in which there is no gene flow between recently diverged 
species (Noor and Bennett 2009; Cruickshank and Hahn 2014; Ravinet et al. 2017; 
Wolf and Ellegren 2017). In these models, reproductive isolation is instantaneous 
and complete (e.g. allopatric speciation). Species pairs show low levels of differ¬ 
entiation because they have recently split. Shared alleles are due to ancestrally 
inherited variation, not introgression. Heterogeneity in the level of differentiation 
across the genome can be explained by stochastic variation in coalescent times and 
natural selection. Loci experiencing strong selection will share less ancestral vari¬ 
ation and will appear to be more differentiated. Hence, genomic regions under 
selection may be directly involved in reproductive isolation or they may be unrelated 
to any trait associated with species divergence and are simply influenced by ‘back¬ 
ground’ selection. In contrast to the previous model, there is no differential gene flow 
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across loci, and the regions of highest diversity do not necessarily house genes 
involved in reproductive isolation (Turner and Hahn 2010). 

When studying genomic differentiation between species, one must consider that 
species divergence does not only involve changes in DNA sequences but also 
rearrangements of genome architecture. Chromosomal rearrangements, such as 
translocations and inversions, often comprise fixed differences between closely 
related species (Coyne and Orr 2004). Once fixed between populations, inversions 
could enhance divergence between populations and eventually lead to speciation 
(Noor et al. 2001; Rieseberg 2001). The role of chromosomal inversions in avian 
speciation has been given less attention compared to sequence divergence (Price 
2008). Chromosomal inversions do have a long history in cytological studies 
(Shields 1982; Hooper and Price 2015), which have recently been complemented 
with studies that used genomic data to map the breakpoints of these inversions 
(Volker et al. 2010; Kawakami et al. 2014). For example, Kawakami et al. (2014) 
compared the genomes of collared flycatcher (Ficedula albicollis) and zebra finch 
(Taeniopygia guttata) and found evidence for 140 inversions, resulting in an 
estimated mean fixation rate of about two inversions per million years. This suggests 
that inversions can evolve fast enough to play an important role in speciation, but the 
relation between inversions and reproductive isolation remains unclear. 

The genomic perspective on speciation has led to many important insights into 
the origin and maintenance of species, but these insights can also be applied in 
species delimitation. Indeed, genomic islands or inversions may be used as diagnos¬ 
tic characters. For example, Poelstra et al. (2014) explored patterns of introgression 
across the European hybrid zone between the carrion crow and hooded crow. A 
small number genomic regions displayed resistance against introgression, including 
one prominent inversion containing genes involved in pigmentation and visual 
perception, reflecting ‘colour-mediated prezygotic isolation’. It is, however, chal¬ 
lenging to confidently show that certain genomic islands or inversions house genes 
that are related to reproductive isolation and are not just the result of background 
selection (Ruegg et al. 2014a, b; Delmore et al. 2015; Bay and Ruegg 2017). 


3.3 Genomics and Phytogeny 

The final class of species criteria listed by Sangster (2014) concerns phylogeny. The 
two species criteria in this class (monophyly and exclusive coalescence of gene 
trees) are not called upon often in avian species delimitation. The application of 
genomic data might even make these criteria obsolete. The advent of multilocus data 
showed that the occurrence of phylogenetic incongruence (i.e. analyses of different 
genes resulting in discordant gene trees) is a common and widespread phenomenon 
(Rokas et al. 2003; Ottenburghs et al. 2016b; Braun et al. 2019). Such incongruence 
can be caused by analytical shortcomings (Rokas et al. 2003; Davalos et al. 2012) or 
can be the result of biological processes, such as horizontal gene transfer, 
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hybridization, incomplete lineage sorting, and gene duplication (Pamilo and Nei 
1988; Maddison 1997; Degnan and Rosenberg 2009). 

Incomplete lineage sorting (ILS) occurs when lineages fail to coalesce in the 
ancestral population of two species (Pamilo and Nei 1988; Maddison 1997; Degnan 
and Salter 2005). The probability of ILS depends on effective population size of the 
ancestral population and the time between two successive speciation events (Degnan 
and Salter 2005; Wakeley 2009). In chimpanzees, gorillas and humans—clearly 
separate species—about 30% of the loci in the genome support a gene tree that 
deviates from the species tree (Scally et al. 2012). Under a strict interpretation of the 
genealogical species concept, which stated that 4 a group of organisms [that] is 
exclusive if their genes coalesce more recently within the group than between any 
member of the group’, chimpanzees, gorillas and humans would constitute one 
species. 

Apart from ILS, introgressive hybridization can cause gene tree discordance. 
Hybridization is a widespread phenomenon in birds, about 16% of the bird species 
hybridized with at least one other species (Ottenburghs et al. 2015). When 
hybridization occurs, the genetic material from one species might introgress into 
the genome of the other species (Anderson 1949; Arnold 2006). This genetic 
exchange has the potential to hamper phylogenetic analyses in a number of ways 
(Rheindt and Edwards 2011; Ottenburghs et al. 2017). First, a hybridization event 
can shorten branch lengths, thereby disrupting or resetting the molecular clock. 
Second, introgression can alter tree topologies. When non-sister species hybridized 
in the past, an introgressed gene might yield a topology that depicts a faulty sister 
relation between these species. This has been observed in several bird genera, such 
as Manacus (Brumfield et al. 2008) and Icterus (Jacobsen and Omland 2012). In 
addition, if the introgression event happened far back in time, it can lead to 
misleading branching arrangements at deep phylogenetic nodes. For instance, 
Fuchs et al. (2013) attribute a conflicting pattern between several loci to ancient 
hybridization between members of the woodpecker genus Campephilus and the 
melanerpine lineage (Melanerpes and Sphyrapicus). 

Genomic analyses of several bird groups show that gene tree discordance— 
caused by incomplete lineage sorting and/or introgression—is common in birds 
(Kulikova et al. 2005; Rheindt et al. 2009; Kraus et al. 2012; Jarvis et al. 2014; 
Nater et al. 2015; Suh et al. 2015; Ottenburghs et al. 2016a, b; Sonsthagen et al. 
2016). Therefore, it seems that the criterion of ‘exclusive coalescence of gene trees’ 
becomes problematic in the genomic era of avian taxonomy. In a genome-scale 
collection of gene trees, there will always be a subset of trees that show exclusive 
coalescence. Are these trees taxonomically more informative than discordant gene 
trees? Putting more weight on a subset of monophyletic gene trees can result in 
cherry-picking to support a particular taxonomic arrangement. Hence, it is advisable 
to report all gene trees and consider genomic evidence in combination with other 
data, such as morphometries, song or plumage. 
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4 Conclusions 

The species problem has been partly resolved by combining theoretical monism (the 
evolutionary species concept or the general lineage concept) and practical pluralism 
(Mayden 1997; De Queiroz 1998). The general lineage concept shows that there is a 
grey zone during the speciation process where different species criteria lead to 
different decisions on the number of species. By acknowledging the continuous 
nature of speciation and realizing that these species criteria are not necessary 
properties of species, taxonomists can make better informed decisions in species 
delimitation. 

The practical pluralism is also apparent in avian taxonomy. An analysis by 
Sangster (2014) showed that almost half of the taxonomic bird studies applied 
multiple criteria and the proportion of studies applying multiple criteria is increasing. 
The most frequently used species criterion is diagnosability, followed by reproduc¬ 
tive isolation and degree of difference. Other species criteria concern the use of 
different adaptive zones, monophyly and exclusive coalescence of gene trees. 
Combining these—often conflicting—species criteria calls for integrative taxonomy, 
which can either proceed by congruence or by cumulation of species criteria 
(Padial et al. 2010). 

The advent of genomic data has prompted some authors to propose a ‘genomic 
species concept’ (e.g. Jarvis 2016). The analysis of the species problem presented 
here shows that such a genomic species concept will not resolve conflicts in species 
delimitation. Instead, genomic data should be incorporated into the existing species 
criteria. With regard to distinguishability (i.e. the use of a unique, fixed character or a 
unique combination of character states to delineate species), for example, genomic 
data can provide an extra line of evidence. 

Reproductive isolation is mostly caused by the combination of several pre- and 
postzygotic isolation mechanisms. The interplay of different reproductive isolation 
mechanisms can be depicted as a continuum from a panmictic population to two 
irreversibly isolated species (Seehausen et al. 2014). Genomic data can be used to 
provide a genetic basis for these reproductive isolation mechanisms (Wu and Ting 
2004; Presgraves 2010). In addition, genomic islands or inversions, which may play 
an important role in certain speciation processes, can be used as diagnostic 
characters. 

Finally, the application of genomic data might complicate the criterion ‘exclusive 
coalescence of gene trees’. The advent of multilocus data showed that the occurrence 
of phylogenetic incongruence (i.e. analyses of different genes resulting in discordant 
gene trees) is a common and widespread phenomenon (Rokas et al. 2003; 
Ottenburghs et al. 2016b), which can be the result of several biological processes, 
such as horizontal gene transfer, hybridization, incomplete lineage sorting and gene 
duplication (Pamilo and Nei 1988; Maddison 1997; Degnan and Rosenberg 2009). 
Hence, it is advisable to report all gene trees and consider genomic evidence in 
combination with other data sources. 
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In conclusion, the genomic era has provided avian taxonomists with a new tool 
box that can be applied to old concepts, leading to better informed decisions on the 
delimitation and description of species. 
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Abstract 

Population genetics is the study of genetic variation within populations and how 
allele frequencies change over space and time. This field largely focuses on the 
five fundamental evolutionary processes that influence genetic variation: muta¬ 
tion, genetic drift, gene flow, natural selection, and recombination. In this chapter, 
we review how genomic data from avian species have advanced our understand¬ 
ing of each of these five processes, including an emphasis on their interactions in 
shaping contemporary genetic diversity on the scale of whole populations. In 
general, genomic data has increased the potential for fine-scale resolution of 
population structure and determination of population boundaries and population 
membership. However, delineating populations is not always straightforward, 
and populations tend to fall on a continuum from isolation to panmixia. Mutation 
is the ultimate source of all genetic variation within populations. The ability to 
sequence whole genomes resulted in better estimates of mutation and substitution 
rates in particular genomic regions (e.g., coding vs. noncoding DNA) and along 
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different avian lineages. The uncovered variation in these rates will further 
advance our knowledge of bird evolution. A genomic perspective on other 
evolutionary forces, such as genetic drift (tightly linked with the concept of 
effective population size f/V e ]), migration, and selection, allows for more detailed 
reconstructions of demographic and phylogeographic history. In addition, the 
estimates of genome-wide recombination rates and their relationship with linked 
selection and GC-biased gene conversion will improve the match between popu¬ 
lation genetic models and biological reality. 


Keywords 

Assortative mating • Demography • Effective population size • GC-biased gene 
conversion • Gene flow • Linked selection • Natural selection • RADseq • 
Recombination • Substitution rates 


1 Introduction 

The field of population genomics, defined as the “process of simultaneous sampling 
of numerous variable loci within a genome and the inference of locus-specific effects 
from the sample distributions,” was first conceptualized by Black IV et al. (2001). 
This initial conceptualization emphasized distinguishing between factors that influ¬ 
ence unlinked loci independently (locus-specific effects), such as mutation, recom¬ 
bination, nonrandom mating, and selection, from those factors that have a similar 
influence on loci throughout the genome (genome-wide effects), such as genetic 
drift, gene flow, and inbreeding. Rather than emphasizing locus-specific effects, 
Luikart et al. (2003) defined population genomics more broadly as “the simultaneous 
study of numerous loci or genome regions to better understand the roles of evolu¬ 
tionary processes [...] that influence variation across genomes and populations” 
(p. 981). In contrast to Black IV et al. (2001), Luikart et al. (2003) concluded that the 
most important contribution of genomic sampling is to provide better inferences of 
population demography and evolutionary history. Hartl and Clark (2007) similarly 
adhered to a broader definition, “the application of population genetics on a genomic 
scale” (p. 469). In this review, we use this more general definition of population 
genomics and examine the fundamental evolutionary processes that influence 
genetic variation: mutation, genetic drift, migration (i.e., gene flow), natural selec¬ 
tion, and recombination (Sects. 3-7). 

Genetic diversity within populations is the result of these five fundamental 
evolutionary forces. Lor the most basic model, equilibrium values of genetic diver¬ 
sity are a function of mutation and genetic drift, both of which are a function of 
population size ( N ). Because more mutations occur and genetic drift is less efficient 
at removing variation in larger populations, genetic diversity should be directly 
proportional to N (Wright 1931), a relationship that has been supported by empirical 
studies (Soule 1976; Lrankham 1996, 2012). However, this simple model makes a 
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number of assumptions, including no immigration, no selection, constant N (i.e., 
drift-mutation equilibrium), nonoverlapping generations, and random mating 
(Wright 1931, 1938; Frankham 1995). In these mathematical models, N is not 
what ecologists would count when they go out in the field and ask “how many 
individuals are there?” The latter question refers to the census population size, 
usually specified as N c . In population genetics, we typically calculate the effective 
population size N e , which is a rather abstract quantity that reflects the genetic 
diversity of a population under study but includes the effects of inbreeding and 
subdivision, among others (Hard and Clark 2007). Typically, N e is much smaller 
than N c (for details see Sect. 2 in this chapter). The effectiveness of selection is also 
dependent on N e (Ohta 1972, 1992; Gillespie 2001; Ellegren 2009). Specifically, if 
the product 2 N e s (where s is the selection coefficient) ^>1.0, selection will override 
drift in determining the fate of mutations, whereas if 2N c s <C 1.0, drift will dominate. 
The emerging field of population genomics has revealed compelling evidence that 
directional selection, balancing selection, purifying selection, and hitchhiking are 
pervasive throughout the genome, causing widespread departures from neutral 
models (Hahn 2008; McVicker et al. 2009; Charlesworth 2012; Burri 2017a). 
However, some genomic features can also be explained by nonadaptive processes, 
such as genetic drift (Lynch 2007). 


2 What Is a Population? 

The term population has been defined in a variety of ways throughout the scientific 
literature (Waples and Gaggiotti 2006). At one extreme, population is essentially 
synonymous with sampling location, referring to a group of individuals sampled 
from a single location. Hard and Clark (2007) defined a population in a more 
biologically relevant way: “a group of organisms of the same species living within 
a sufficiently restricted geographical area so that any member can potentially mate 
with any other member of the opposite sex” (p. 45). In an ideal population of 
sexually reproducing individuals, mating is random, and any individual has an 
equal probability of mating with any other individual from the same population 
(i.e., the population is panmictic). However, it is questionable whether any popula¬ 
tion is truly panmictic. Mating is rarely, if ever, completely random, but rather 
individuals are more likely to mate with individuals in close proximity. In other 
words, the probability of mating decreases with increasing distance between 
individuals, and this nonrandom mating results in a spatial organization of genetic 
variation (i.e., isolation by distance), even in the absence of any other factors such as 
mate choice or mobility. When making inferences about demographic histories or 
selection, the distribution of alleles across space becomes critically important. 

Most definitions of a population are not operational in the sense that they fail to 
provide quantitative criteria for determining which individuals belong to the same or 
a different population. Waples and Gaggiotti (2006) suggested using the number of 
effective migrants per generation (N e m, where N e is the effective population size and 
m is the migration rate) as an operational criterion for determining whether groups of 
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individuals could be considered a population. The threshold for this criterion is 
somewhat arbitrary, and estimating N e m can be cumbersome, especially with geno¬ 
mic datasets. Moreover, calculating N e m requires some a priori knowledge about 
which individuals are grouped. Therefore, the first hurdle in delineating populations 
is determining which individuals are sufficiently similar that they can be considered 
part of the same population. 

In the past, statistical power from only a small number of genetic markers from 
distant regions of the genome has often been insufficient to unveil weak population 
structure, and increasing the number of markers has clearly shown that more markers 
give better signals (Kraus et al. 2015). Population genomics uses technology to 
increase the number of genetic markers by orders of magnitude (Kraus and Wink 
2015; Wink 2019), thereby offering the potential for fine-scale resolution of popula¬ 
tion structure and determination of population boundaries and population member¬ 
ship. Peters et al. (2016) conceptualized a quantitative framework for using large- 
scale genetic datasets to delineate “conservation units.” This framework, largely 
inspired by approaches in Harvey and Brumfield (2015), can be applied to 
delineating populations. Using the mottled duck (Anas fulvigula) and genotypes 
obtained from a reduced representation genomic approach, double-digest restriction- 
associated DNA sequencing (ddRAD-seq), Peters et al. (2016) used a variety of 
analytical methods to distinguish between apparent panmixia, discrete population 
units, and isolation by distance. Specifically, they demonstrated that the Florida and 
Western Gulf Coast populations of mottled ducks were discrete units—genotypes 
were sufficiently similar within regions and different between regions that (1) all 
individuals grouped together in population-specific clusters on the basis of ddRAD- 
seq genotypes (Fig. la), (2) all individuals were assigned unambiguously to their 
populations of origin, (3) the geographic area separating these populations was a 
better predictor of allele frequency differences than geographic distance alone, and 
(4) there was limited evidence of admixture and gene flow between these 
populations. In contrast, there was no evidence of population structuring within 
Florida or the Western Gulf Coast. Therefore, in the case of mottled ducks, 
delineating population boundaries was unambiguous. Other studies of avian taxa 
have used similar approaches with ddRAD-seq data to demonstrate discrete 
differences in multilocus genotypes between geographic groups (Parchman et al. 
2013; Harvey and Brumfield 2015; Kopuchian et al. 2016), and such discrete 
population structure has been used as evidence for species delimitation (Oswald 
et al. 2016). 

In contrast to the discrete population units found in mottled ducks, studies of 
some avian taxa found ambiguous evidence of population boundaries (Kraus et al. 
2013; Lavretsky et al. 2015). For example, Lavretsky et al. (2015) used principal 
component analyses (PCA) to cluster individuals based on ddRAD-seq genotypes in 
mallards (Anas platyrhynchos) and Mexican ducks (A. diazi). Although there was 
some evidence of discrete or nearly discrete populations (e.g., eastern and western 
populations of mallards and mallards vs. Mexican ducks), individuals could not be 
unambiguously assigned to populations and there appeared to be substantial admix¬ 
ture. Overall, a pattern of isolation by distance seemed to describe the geographic 
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A) Discrete population units in Mottled Ducks 
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Fig. 1 Examples of the gradient of possible outcomes when applying genomic data to inferences of 
population boundaries, including (a) discrete population units in mottled ducks (Peters et al. 2016), 
(b) continuous variation with possible isolation by distance in mallards and Mexican ducks 
(Lavretsky et al. 2015), and (c) apparent panmixia in turtle doves (Calderon et al. 2016) 
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distribution of alleles (Fig. lb); for example, western mallards, which are geograph¬ 
ically closer to Mexican ducks, were genetically intermediate between eastern 
mallards and Mexican ducks, and Mexican ducks sampled from the United States 
were genetically intermediate between mallards and Mexican ducks sampled from 
Mexico. Within Mexico, there was a stepping-stone pattern of genetic differentia¬ 
tion: individuals from the most geographically distant sampling locations were the 
most genetically differentiated (e.g., Sonora vs. Puebla), whereas there was substan¬ 
tial overlap in principal component (PC) scores among individuals from neighboring 
sites (e.g., Puebla vs. the state of Mexico). A similar pattern of isolation by distance 
was also found among subspecies of dark-eyed junco (. Junco hyemalis ) in North 
America: the principal components showed a striking resemblance to geographic 
distribution (Friis et al. 2016). Although the slate-colored junco (/. h. hyemalis ) does 
appear to be a discrete population, it is important to emphasize that all the individuals 
examined were sampled from the same location; more comprehensive sampling 
across their range will be necessary to determine whether this subspecies represents 
a discrete population or if it fits within a broader pattern of isolation by distance. 
Otherwise, for both juncos and Mexican ducks, the challenge is that delineating 
population boundaries is not possible given the gradation in multilocus genotypes 
over space, despite clear evidence of population structure. Thus, the use of popula¬ 
tion genomics to infer aspects of population demography, history, and selection 
necessitates models that incorporate isolation by distance. 

Similar to Mexican ducks and dark-eyed juncos, population genomics suggests 
that red crossbills ( Loxia curvirostra) comprise a mix of discrete, nearly discrete, and 
non-discrete ecotypes that loosely correspond to geographic populations (Parchman 
et al. 2016). However, in this case, there was no overall pattern of isolation by 
distance, at least partly as a result of their nomadic behavior. For example, PCA 
clusters individuals from the western and eastern parts of the red crossbill’s range to 
the exclusion of individuals from the interior. Parchman et al. (2016) concluded that 
adaptation to conifer species, rather than geography, was a better explanation of the 
observed genetic differentiation. In addition, the population from South Hills, Idaho, 
USA, appeared to be a discrete population that was genetically distinct from other 
crossbills, and these results coupled with differences in morphology and calls have 
resulted in the recognition of a distinct species, the Cassia crossbill (L. sinesciuris) 
(Chesser et al. 2017). 

In some cases, population genomics might fail to reveal population structure, 
even for species with broad geographic distributions. For example, Calderon et al. 
(2016) sampled European turtle doves ( Streptopelia turtur) from locations through¬ 
out eastern and western Europe and obtained genomic data using ddRAD-seq. Using 
PCA on single-nucleotide polymorphisms (SNPs; pronounced snips), they found 
that PC scores overlapped substantially among individuals from different sampling 
locations (Fig. lc). Thus, despite its widespread distribution, genetic variation within 
European turtle doves is consistent with a single, panmictic population. On ecologi¬ 
cal timescales, populations from the different regions may or may not be demo- 
graphically independent; however, on evolutionary timescales, there is sufficient 
genetic connectivity (i.e., gene flow, range expansion) that detectable population 
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structure does not emerge. Population genomic data likewise failed to reveal popu¬ 
lation structure in mountain chickadees (Poecile gambeli ), despite geographically 
structured phenotypic variation and evidence of local adaptation in life history 
(Branch et al. 2017). Thus, for the purpose of population genomics, samples from 
different regions could be pooled and analyzed as a single population for inferences 
of evolutionary history. 

The above case studies illustrate possible outcomes of inferring population 
structure using genomic data and multivariate statistics. Waples and Gaggiotti 
(2006) provided a visual representation of the continuum of population differentia¬ 
tion, from isolation to panmixia, and PCA and other similar orthogonal 
transformations (e.g., discriminant function analysis) offer the ability to visualize 
where species of interest fall within this continuum. For instance, the examples 
discussed above illustrate this continuum; mottled ducks (Fig. la) clearly fit the 
scenario of isolation or “complete independence,” Mexican ducks (Fig. lb) fit both 
“modest connectivity” (Sonora, USA, and inland sampling locations) and “substan¬ 
tial connectivity” (inland locations: Zacatecas, Guanajuato, Mexico, and Puebla), 
whereas European turtle doves (Fig. lc) best fit “panmixia.” Further advances could 
be made by developing methods for quantifying this structure to facilitate 
comparisons across taxa from different studies. Also, such approaches are applicable 
to the opposite end of distribution of genetic variation when this leads into speciation 
(Ottenburghs 2019). 


3 Mutation 

The ultimate source of all genetic variation within populations is mutation, which 
changes the nucleotide sequences within a region of DNA through a point mutation 
(a single base pair change), insertion or deletion of one or more nucleotides, 
inversions, etc. Mutation is independent in different populations. In the absence of 
homoplasy (i.e., recurrent mutations, back mutations to the previous state) and gene 
flow, new mutations that arise after populations split will be unique to a single 
population and cause populations to genetically diverge over time. 

Mutation rates have been estimated across the tree of life, from simple RNA 
viruses and bacteria to higher eukaryotes, and vary widely from 7.2 x 10 7 to 
7.2 x 10“ 11 per base pair per generation (Drake et al. 1998). In humans, this estimate 
translates to a germline mutation rate of about 0.5 x 10 9 per base pair per year 
(Scally 2016). The number of new mutations that enter a population each generation 
is a function of N c . However, many mutations are lethal or strongly deleterious and 
are not passed to future generations. Therefore, in population genetics, we consider 
the substitution rate, which is the rate at which new mutations accumulate over time. 
The substitution rate depends on both the rate at which mutation adds new variants 
and the rate at which natural selection removes deleterious or lethal mutations (see 
Box 1 in Barrick and Lenski 2013). In the case of strictly neutral evolution, when 
new variants do not affect biological fitness, the substitution rate is equal to the 
mutation rate. However, with genomic data, the substitution rate is lower than the 
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mutation rate, and in the absence of mutation accumulation experiments (Barrick and 
Lenski 2013) in birds, we can only measure the long-term substitution rates. 

Genomic substitution rates vary considerably among lineages of birds. Substitu¬ 
tion rates have been estimated for fourfold degenerate sites in coding regions. 
Fourfold degenerate refers to the observation that each of the 4 nucleotides at a 
site results in the same amino acid. A substitution at a fourfold degenerate site is also 
referred to as a synonymous substitution. The substitution rate at these sites was 
estimated to be approximately 3.3 x 10 9 substitutions per site per year (s/s/y) for 
Passeriformes (perching birds) and <1.0 x 10 -9 s/s/y for Struthioniformes 
(ostriches) (Zhang et al. 2014). The global rate across all avian lineages was 
approximately 1.9 x 10 -9 s/s/y (Zhang et al. 2014). Similarly, Nam et al. (2010) 
found a nearly twofold difference in substitution rates at fourfold degenerate sites 
(1.23-2.21 x 10 -9 s/s/y), with the lowest rates in ancestral bird lineages and the 
highest rates in a representative of Passeriformes. The substitution rate estimated 
from ddRAD-seq, which generates a pseudorandom sampling of the genome and 
includes sequences from both coding and noncoding regions, was similar to that 
found at fourfold degenerate sites—approximately 1.75 x 10 -9 s/s/y for a lineage of 
Anseriformes (waterfowl) (Peters et al. 2016). However, the substitution rate for 
ultraconserved elements (UCEs) and their flanking regions was found to be about an 
order of magnitude lower, 2.59 x 10“ 10 s/s/y, in a lineage of Charadriiformes 
(shorebirds) (Oswald et al. 2016). 

Substitution rates also vary across the genome. As a general rule, substitutions 
accumulate more rapidly in noncoding regions of the genome, such as introns and 
intergenic regions, than in protein-coding exons. However, avian genomes contain 
an estimated 3.2 million highly conserved elements (HCEs) interspersed throughout 
both noncoding and coding DNA (Zhang et al. 2014), and these HCEs contribute to 
high variation in substitution rates even within classes of DNA. Similarly, overall 
substitution rates also vary among chromosomes. Based on analyses of 
transcriptomes for ten species of birds, d s (divergence at synonymous sites) was 
negatively correlated with chromosome size, suggesting that the synonymous sub¬ 
stitution rate is lower for larger chromosomes than for smaller chromosomes 
(Kunstner et al. 2010). They also found that d s was higher for the Z chromosome 
than for autosomes, a pattern that was also observed by Zhang et al. (2014) in a 
comparative analysis of full genomes from 45 avian species. 

In addition to providing information about the rate of evolution, estimates of 
substitution rates are necessary to calculate demographic parameters from sequence 
data. For example, percent sequence divergence (d) can be calculated directly from 
genomic data with the formula d = 2 pt, where p is the substitution rate and t is the 
time since divergence. Thus, having an estimate of p (in substitutions per site per 
year ) allows us to estimate the number of years since two species or populations 
began diverging. Similarly, genetic data can provide an estimate of the composite 
parameter 6 {theta), where 6 = 4 N e p, and an estimate of p (in substitutions per site 
per generation) can therefore be used to estimate effective population sizes. These 
estimates of demographic parameters are important for making inferences about 
evolutionary history, conservation priorities, and phylogeography. 
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4 Genetic Drift and Effective Population Sizes 

Whereas mutation adds genetic variation to a population, genetic drift removes 
it. Genetic drift is the stochastic fluctuation in allele frequencies over time that 
results from the random survival of individuals and the random sampling of gametes 
during reproduction. In an idealized population of size N c , the probability that two 
copies of a gene randomly sampled from a population are identical by descent (i.e., 
they were derived from the same ancestor in the previous generation) is 1/2 N c . 
Lineages that fail to leave descendants go extinct, and any unique mutations within 
those lineages are lost. Because the rate at which genetic variation is lost is inversely 
correlated with population size, smaller populations lose variation more rapidly than 
larger populations. However, this relationship assumes a constant population size 
(i.e., population sizes remain the same between generations), generations that do not 
overlap, 1:1 sex ratios, equal variance in reproductive success between the sexes, and 
random mating. In reality, populations deviate from these assumptions, which 
usually results in a faster rate of genetic drift than expected given N c . The N e is the 
size of an ideal population that loses genetic variation at a rate equal to that of the 
actual population (Wright 1931). In other words, N e quantifies the rate at which 
genetic drift decreases genetic diversity within a population. Across a wide range of 
studies, Frankham (1995) estimated that N e averaged about 0.1 tV c . 

Applications of genomics to inferences of N e and the role of genetic drift have 
primarily focused on fluctuations in population sizes over evolutionary time, with a 
particular emphasis on the role of past climate changes. Calderon et al. (2016) used 
approximate Bayesian computation (ABC) to fit ddRAD-seq data from European 
turtle doves to five models of demographic history, including constant population 
sizes and various scenarios of fluctuating population sizes. They found that their data 
best fit a model that included a population expansion during the late Pleistocene 
(-78,000 years before present; ybp) followed by a population decline during the 
Holocene (-7600 ybp). Reductions in A e have also been inferred from ddRAD-seq 
data for various species of dry forest birds from South America (Oswald et al. 2017). 
Interestingly, they found similar changes in N e between ancestral and daughter 
populations among the six species studied, despite considerable variation in popula¬ 
tion divergence times. They attributed these long-term reductions in N e to historical 
reductions in the geographic extent of dry forests in this region. One of the main 
strengths of these inferences lies within the hypothesis-driven framework that is 
often used in population genomics and phylogeography (Carstens et al. 2017; see 
Sect. 8). In particular, fitting the data to various models of population size changes 
and using a Bayesian or likelihood approach to choose the best-fit model make it 
possible to reject simpler models in favor of more complex models. 

Whole-genome data from a single diploid individual can also provide information 
about past population size changes. In a comparative study of 38 bird species, 
Nadachowska-Brzyska et al. (2015) used the pairwise sequentially Markovian 
coalescent (PSMC, Li and Durbin 2009) model to show that demographic histories 
varied considerably among species and that the N e of some species fluctuated by 
orders of magnitude. One prominent pattern was a major reduction in population 
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sizes associated with the last glacial period (LGP; -110-12 kya). Surprisingly, 
however, Nadachowska-Brzyska et al. (2015) did not find a relationship between 
the extent of the decline and whether current ranges overlapped with regions 
severely influenced by glaciation (e.g., were formerly covered in ice or extreme 
deserts). Similar patterns of demographic fluctuations and major reductions in N e 
associated with the last glacial period have been inferred from whole-genome 
sequences and the PSMC for grouse ( Lagopus spp.) (Kozma et al. 2018), black- 
and-white flycatchers ( Ficedula spp.) (Nadachowska-Brzyska et al. 2016), and geese 
(genera Anser and Branta) (Ottenburghs et al. 2017b). 


5 Gene Flow 

When a population gets subdivided, random genetic drift and selection can lead to 
genetic divergence among the subpopulations. Migration—the movement of 
organisms among these subpopulations—can act as a kind of genetic glue that 
binds the subpopulations genetically and sets a limit to the amount of genetic 
divergence that can accumulate (Hard and Clark 2007). In the literature, migration 
and gene flow are often used interchangeably. However, there is an important 
difference between both terms: migration refers to the movement of individuals 
between subpopulations, while gene flow encompasses the movement of alleles and 
their establishment into a different gene pool (Tigano and Friesen 2016). Hence, 
migration does not necessarily result in gene flow (Verhulst and Van Eck 1996). 

Direct estimates of migration often involve mark-recapture methods, which can 
be impractical and labor-intensive for large populations with low migration rates. 
Therefore, indirect measures based on genetic data are mostly preferred. Early 
studies estimating gene flow—expressed as N e m —from genetic data relied on F ST 
or other measures of differentiation (Slatkin and Barton 1989). However, the 
population genetic models for these estimations assume unrealistic conditions, 
such as constant population size, symmetrical migration, and mutation-drift equilib¬ 
rium (Whitlock and McCauley 1999; Wilson and Rannala 2003; Marko and Hart 
2011). The development of non-equilibrium approaches provided the opportunity to 
assess more realistic scenarios of gene flow. Specifically, isolation-with-migration 
models enabled the joint estimation of gene flow, genetic diversity, and divergence 
times within a maximum likelihood framework (Hey and Nielsen 2004; Hey 2006; 
Hey et al. 2018). For example, isolation-with-migration analyses based on a 
multilocus dataset indicated asymmetrical gene flow from indigo bunting ( Passerina 
cyanea) to lazuli bunting (P. amoena) (Carling et al. 2010). Alternative software 
packages, such as migrate-n (Beerli and Palczewski 2010), have also been used to 
quantify the degree of gene flow in migrating waterfowl populations of mallard 
(Anas platyrhynchos ) (Kraus et al. 2013) and barnacle goose (Branta leucopsis) 
(Jonker et al. 2013). 

Similar to the inference of genetic drift and effective population sizes, the 
development of ABC models allowed population geneticists to probe more complex 
models and evaluate the extent of gene flow by comparing simulated DNA sequence 
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evolution with empirical data (Beaumont 2010). For instance, a recent study com¬ 
pared 15 models (with different patterns and levels of gene flow) to assess the 
demographic history of pied flycatcher ( Ficedula hypoleuca) and collared flycatcher 
(F. albicollis). ABC modelling based on whole-genome re-sequencing data from 
20 individuals supported a recent divergence with unidirectional gene flow from 
pied to collared flycatcher after the Last Glacial Maximum (Nadachowska-Brzyska 
et al. 2013). Similar analyses have been performed to assess the demographic history 
of other bird species, such as Melospiza sparrows (Smyth et al. 2015), Myrmeciza 
antbirds (Raposo do Amaral et al. 2013), and Platalea spoonbills (Yeung et al. 
2011). These studies indicate that model-based approaches are a fruitful avenue for 
the reliable estimation of gene flow (Ottenburghs et al. 2017a). Recently, machine 
learning techniques are being applied to population genomic questions (Schrider and 
Kern 2018), but this approach has not reached the ornithological community yet. 

The development of more sophisticated tools in combination with the availability 
of genomic data led to important insights into the role of gene flow in population 
dynamics (Ottenburghs et al. 2017a). Similar to mutation, gene flow can introduce 
novel alleles into a population. Even between species this can be shown when 
modelling the probability of allele sharing between, e.g., related duck species with 
or without assuming hybridization (Kraus et al. 2012). The main difference with 
mutation is the speed at which this happens: the rate of migration is vastly greater 
than the rate of mutation (Hedrick 2013). The fate of these novel alleles depends on 
the specific genetic and environmental context in which they end up (Payseur 2010). 
In general, alleles can be divided into three categories: (1) neutrally evolving alleles 
that flow freely between populations, (2) alleles that confer an adaptive advantage 
and flow quickly, and (3) alleles that are not adapted to local conditions and are 
consequently selected against. 

These allele-specific patterns of gene flow result in a heterogeneous genomic 
landscape in which some genomic regions are more prone to be exchanged between 
populations than others (Nosil et al. 2009; Ravinet et al. 2017; Wolf and Ellegren 
2017). For example, a study comparing the genomes of hooded crow ( Corvus corone 
cornix) and carrion crow (C. c. corone ), two subspecies that interbreed along a 
narrow hybrid zone across Europe, uncovered a peculiar genomic landscape in 
which gene flow was relatively unrestricted across the genome except for one 
genomic region. This region harbored several genes involved in pigmentation and 
visual perception, suggesting a role in reproductive isolation (Poelstra et al. 2014). 

In recently diverged populations, reproductive isolation can be caused by assor- 
tative mating, in which individuals with similar phenotypes mate with one another 
more frequently than would be expected under a random mating pattern (Ritchie 
2007; Uy et al. 2018). For instance, the alba and personata subspecies of the white 
wagtail ( Motacilla alba) mate assortatively based on head plumage patterns. This 
nonrandom mating results in a reduction in gene flow—estimated using almost 
20,000 SNPs—between these subspecies (Semenov et al. 2017). The traits underly¬ 
ing assortative mating are various (e.g., song, plumage, behavior) and can originate 
in different ways (Uy et al. 2018). Sexual selection can drive changes in mating 
preferences and associated display traits (Ritchie 2007; Kopp et al. 2018). 
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Alternatively, natural selection can cause divergence in traits not related to mate 
choice, which may later be co-opted as mating signals, so-called magic traits 
(Servedio et al. 2011). In the end, natural and sexual selection can act in concert, 
culminating in a barrier to gene flow (Servedio and Boughman 2017). This synergy 
between natural and sexual selection is nicely illustrated by bird species in which 
different subpopulations are adapted to different food sources. Divergent natural 
selection can then result in distinct beak morphologies, which consequently produce 
different acoustic signals, such as songs or call types. Assortative mating based on 
song or call type can lead to a reduction in gene flow between subpopulations. This 
scenario has been described for Loxia crossbills (Parchman et al. 2006), Melospiza 
sparrows (Ballentine et al. 2013), and Aphelocoma scrub jays (Langin et al. 2015). 
So far, genetic data has allowed population geneticists to document these patterns, 
and genomics will lead to a more fine-grained picture of gene flow dynamics and 
provide the opportunity to pinpoint the genetic basis of the traits underlying assorta¬ 
tive mating. 

In addition to assortative mating, barriers to gene flow can also be physical. 
Numerous studies have documented how mountain ranges (Manthey et al. 2016; 
Moyle et al. 2017; Machado et al. 2018; Padro et al. 2018), rivers (Maldonado- 
Coelho et al. 2013; Fernandes et al. 2014), ecological transitions (Caro et al. 2013; 
Zhen et al. 2017; Garg et al. 2018), and sea currents (Munro and Burg 2017) can 
limit dispersal and act as barriers to gene flow. However, when assessing how 
geographical and topological barriers influence patterns of gene flow, it is important 
to keep the ecology and dispersal capacity of the species under investigation in mind. 
A study on Pleistocene land bridges in Sulawesi emphasizes this point: using 
ddRAD-seq data, the authors estimated the amount of ancient gene flow between 
the island populations of two bird species, the henna-tailed jungle flycatcher 
(Cyornis colonus) and the golden whistler (Pachycephala pectoralis). During the 
Pleistocene, the islands Peleng and Taliabu were connected by land bridges allowing 
animals to disperse from one island to the other. The analyses revealed little evidence 
of genetic exchange between the jungle flycatcher populations on Peleng and 
Taliabu, whereas there had been gene flow between island populations of golden 
whistler. The differences in gene flow dynamics probably depended on the ecology 
of the species: the jungle flycatcher is a specialized bird with poor dispersal 
capacities and does not venture outside forests often. The golden whistler, however, 
is a generalist that tends to explore new territories (Garg et al. 2018). Similarly, 
research on the role of Amazonian rivers as barriers to gene flow has culminated in 
contrasting results: some studies report clearly separated populations on each side of 
the river (Maldonado-Coelho et al. 2013; Fernandes et al. 2014), while other studies 
documented gene flow at headwaters (Weir et al. 2015; Sandoval-H et al. 2017). In 
summary, what might be a barrier for one bird species is not necessarily a barrier for 
another one. 
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6 Selection 

Species are continuously adapting to ever-changing environments (Dobzhansky 
1940; Bush 1975; Orr and Smith 1998). Genetic differences that arise through 
mutation or enter a population by gene flow result in populations of individuals 
with subtle morphological, ecological, or other differences (Coyne and Orr 2004). It 
is this diversity that selection works with, and thus these differences among 
individuals often dictate the “adaptability” of a species or population (Barton and 
Hewitt 1989; Orr 2001). Specifically, selection favors morphological, ecological, or 
other traits that increase survival and fecundity of an individual in a particular niche 
space (Fischer 1930; Price 1998; Rundle and Nosil 2005; Via 2009; Sobel et al. 
2010; Wolf et al. 2010). In fact, it was in 1859 that Charles Darwin determined that 
the composition of a population or species changes (or evolves) due to the differen¬ 
tial survival of individuals in varying environments and coined the responsible force 
as “natural selection” (Darwin 1859). Thus, evolution proceeds through the selection 
of traits that provide a competitive advantage, consequently increasing mating 
success. Finally, since Charles Darwin established natural selection as a dominant 
force in the evolutionary process, there has been a refinement regarding the types of 
selection. For example, the elaborate feathers and mating displays of birds are 
classical examples of sexual selection (Lande 1980; Andersson 1994; Johnsgard 
1994; Promislow et al. 1994; Grant and Grant 1997; Price 1998; Clutton-Brock 
2007; Krakauer 2008). In such a case, sexual selection confers higher mating success 
for the displaying sex despite any negative impact the trait may have via natural 
selection (i.e., predation). 

Given that the number and survival of new mutations is largely dictated by 
population size (i.e., more mutations enter and are maintained in larger populations), 
selection is most effective in large populations (Ohta 1972, 1992; Gillespie 2001; 
Ellegren 2009), whereas genetic drift will dominate in smaller populations (Sect. 4). 
Thus, the probability of beneficial mutations to be lost due to genetic drift increases 
as population size becomes increasingly small. Due to a lag effect on the influence 
from selection on traits and associated genetic variation, the majority of new 
mutations are often lost due to genetic drift as a result of their naturally low 
frequency within any population if the selection favoring the new mutations is not 
very strong (i.e., relatively small selection coefficient s). In short, higher individual 
diversity increases the probability that a species survives challenges, such as changes 
affecting their current environment or when invading novel niche space (Turelli et al. 
2001; Wu 2001; Coyne and Orr 2004). For example, a species that is comprised of 
largely clonal individuals (i.e., low genetic diversity) has little chance to survive new 
ecological or other challenges because none of the individuals have variants that 
would confer an adaptive response. Such scenarios are often an important issue for 
endangered or highly specialized taxa with small population sizes or ranges (e.g., 
islands) (Dickerson 1973; Templeton 1986; Lacy 1987; Hughes et al. 1997; Oyler- 
McCance et al. 1999; Mock et al. 2004). Captive breeding programs often need to 
contend with this issue (Elsbeth McPhee 2004; Fraser 2008; Cassin-Sackett et al. 
2019). Conversely, a species comprised of a diversity of individuals is more likely to 
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have a proportion of individuals that may have the (genetic) variation necessary to 
survive the same challenges. 

The emerging field of population genomics has revealed compelling evidence 
that directional selection, balancing selection, purifying selection, and hitchhiking 
are pervasive throughout the genome, causing widespread departures from neutral 
models (Hahn 2008; McVicker et al. 2009; Charlesworth 2012; Bum 2017a). Thus, 
in addition to variance in mutation (Sect. 3) and recombination (Sect. 7) rates, as well 
as differential gene flow (Sect. 5), the variance in selection also contributes to the 
heterogeneous nature of genomes. Importantly, just as with the other evolutionary 
forces, selective processes leave traceable signatures across the genomes of 
populations that researchers are now able to discern between (Wu and Ting 2004; 
Sabeti et al. 2006; Wolf et al. 2010; Schoville et al. 2012; Keller et al. 2013; Wray 
2013; Seehausen et al. 2014; Andrews et al. 2016; Van Belleghem et al. 2018). 
Often, these genes or genetic regions are “needles in a haystack,” and thus, increas¬ 
ing genomic coverage is essential for their discovery. Once a limitation by standard 
Sanger sequencing methods, the genomic era now enables researchers to attain 
sufficient genomic coverage required when searching for regions under selection 
in a genome (Kraus and Wink 2015; Jax et al. 2018b). Thus, by accessing larger 
portions of the genome, researchers are able to (1) determine how selection has 
operated in the evolution of their taxonomic system, (2) find important genes 
associated with adaptive traits in their systems, and (3) distinguish between differing 
selective signatures. 

The identification of putative genes or genetic regions under selection is often 
accomplished through “genomic scans” in which markers are compared between 
taxa of interest with various summary statistics, such as relative (e.g., F ST , 0 ST ) and 
absolute (e.g., d X y) genetic divergence, as well as other measures of genetic diver¬ 
sity (e.g., pairwise nucleotide diversity n, Tajima’s D). For example, conducting 
these genomic scans across -3500 ddRAD loci, Lavretsky et al. (2015) were able to 
demarcate putative outliers (demarcated as regions of elevated genetic divergence) 
on the Z-sex chromosome and several autosomal chromosomes that may be linked to 
genes important in the divergence process between mallards and Mexican ducks 
(Fig. 2). Additionally, advances in Bayesian and maximum likelihood methods 
capable of analyzing large genomic datasets now allow researchers to assign statis¬ 
tical significance to each outlier (Perez-Figueroa et al. 2010; Feng et al. 2015). In 
short, these programs often test each marker by comparing it to the overall genomic 
background to determine their statistical significance. For example, genomic scans 
and statistical tests revealed that the evolution of high-elevation adaptation for 
several Andean birds was due to the simple effect of positive selection on amino 
acid changes in hemoglobin for higher oxygen affinity (McCracken et al. 2009; 
Natarajan et al. 2015). 

Sex-linked markers have been particularly interesting, as these have often been 
found to have significantly higher divergence patterns as compared to autosomal 
and/or mitochondrial markers. These patterns are especially detectable when specia- 
tion is at the earliest stage (Haldane 1948; Frank 1991; Reeve and Pfennig 2003; 
Phadnis and Orr 2009) and have been documented in birds (Minvielle et al. 2000; 
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Fig. 2 Distribution of <£> ST values for chromosomes containing significant outliers 
(chromosomes Z, 1,2, 3,4, and 14) for pairwise comparisons between mallards (MALL), American 
black ducks (ABDU), Mexican ducks (MEDU), Western Gulf Coast mottled ducks (MODUWGC), 
and Florida mottled ducks (MODUFL). Black dots denote markers identified to be putatively under 
diversifying selection in each pairwise comparison when analyzed in BayeScan v. 2.1 (Foil and 
Gaggiotti 2008). Such comparative analyses provide the opportunity to identify in which species 
divergent selection may be occurring in. For example, the outlier region within an -11 Mbp region 
(1.0 x 10 8 -1.2 x 10 8 bp) on chromosome 1 was found when comparing mallards to each of the 
monochromatic taxa, suggesting divergent selection occurring in mallards. Similarly, an outlier 
locus on chromosome 14 (position ~1.6 x 10 7 ; also see Lavretsky et al. 2015) was detected in all 
four comparisons involving Mexican ducks, suggesting directional selection at this or a linked locus 
in Mexican ducks only. The figure was adapted from Lavretsky et al. (2019) 


Saether et al. 2007; Pryke 2010), insects (Phadnis and Orr 2009; Martin et al. 2013), 
and mammals (Tucker et al. 1992; Sutter et al. 2013). For example, important 
reproductive isolation mechanisms, such as male sterility, sexually selected male 
plumage traits, and assortative mating, have all been linked to sex chromosomes 
(Minvielle et al. 2000; Saether et al. 2007; Turelli and Moyle 2007; Carling and 
Brumfield 2009; Phadnis and Orr 2009; Pryke 2010; Abbott et al. 2013; Pease and 
Hahn 2013; Stolting et al. 2013). In general, recent work suggests that due to the 
effect of recombination on the possible breakup of coadapted genes and admixture of 
alleles between diverging populations via gene flow, selection is more likely to lead 
to the adaptive divergence of traits linked to markers found in regions of low 
recombination because these regions are shielded from maladaptive gene flow 
from other populations (Delmore et al. 2015; Samuk et al. 2017). Thus, the proba¬ 
bility of recovering markers linked to evolutionarily important regions on sex 
chromosomes is likely the product of their smaller absolute and effective size, as 
well as higher linkage disequilibrium as compared to autosomes (Bergero and 
Charles worth 2009). For example, conducting genomic scans using ddRAD-seq 
data between mallards and Mexican ducks, Lavretsky et al. (2015) found 2-3% of 
Z-linked loci, compared to <0.1% of autosomal loci as outlier loci under divergent 
selection. Indeed, elevated Z-differentiation deviated from neutral expectations 
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when simulating data that incorporated demographic history and differences in 
effective population sizes between marker types. In contrast, Z-linked and autosomal 
differentiation (d> ST = 0.017 and 0.013, respectively) were similar among the seven 
Mexican duck sampling locations, following a scenario of genetic drift and isolation 
by distance. Similar to Mexican ducks and mallards, Chaves et al. (2016) found that 
key adaptive traits (e.g., beak size) in Darwin’s finches were also associated with a 
few genes (11 of 32,569 SNPs) but found these putatively evolutionary important 
genes across multiple chromosomes. Similarly, other studies also report genetic 
regions involved in adaptive divergence and reproductive isolation to be scattered 
throughout the genome (Parchman et al. 2013). 


7 Recombination 

Similar to other genomic parameters, such as gene density and mutation rate, 
recombination rate is highly variable along a genome. Regardless of chromosome 
size, at least one crossover per chromosome (or chromosome arm) is required for 
proper segregation of homologous chromosomes during meiosis (Fledel-Alon et al. 
2009; Wang et al. 2012). This obligatory crossover results in a negative correlation 
between recombination rate and chromosomes size because the rate is calculated as a 
total genetic distance (in centimorgans, cM) divided by physical size of a chromo¬ 
some (in Mb). Because of the large differences in chromosome size in bird genomes 
(Damas et al. 2019), recombination rate is an order of magnitude different between 
the largest and smallest chromosomes in birds (Groenen et al. 2009; Backstrom et al. 
2010; Kawakami et al. 2014; van Oers et al. 2014). In addition, recombination rate is 
also variable within a chromosome, where the rate is lower near centromeres and 
increases away from them (Choo 1998; Talbert and Henikoff 2010). At a finer scale, 
birds and several other species have small genomic regions, referred to as recombi¬ 
nation hotspots, where the rate is often hundreds or even thousands times higher than 
the surrounding regions (reviewed in Stapley et al. 2017). Genomic locations of 
recombination hotspots appear to be conserved over tens of millions of years during 
bird evolution (Singhal et al. 2015; Kawakami et al. 2017). Furthermore, the pseudo- 
autosomal region (PAR), the only recombining region on sex chromosomes in the 
heterogametic sex (i.e., female birds with Z and W sex chromosomes), shows an 
extremely high recombination rate (>700 cM/Mb) (Smeds et al. 2014). Therefore, a 
highly heterogeneous recombination landscape is a hallmark of avian genomes, and 
characterizing detailed variation of recombination rate is a necessary step toward the 
understanding of how genetic variation changes over time in a genome. 

There are at least two ways for recombination to affect genetic variation in a given 
genomic region, namely, “linked selection” and GC-biased gene conversion 
(gBGC). As discussed in Sect. 6, positive selection removes genetic variation at a 
locus under selection by fixation of an advantageous allele, while negative selection 
(purifying or background) reduces genetic variation because new mutations in 
functionally important regions, such as protein-coding genes and regulatory 
elements, cannot increase in frequency if they have a negative effect on fitness 
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(i.e., deleterious mutations). Removal of variants is not restricted to target loci under 
selection (positive and negative); variants at neighboring loci can also be removed 
from a population if those neighboring loci are physically linked to the target loci 
(hence referred to as “linked selection”) (Cruickshank and Hahn 2014; Burri 2017b). 
Since the extent of linkage between loci under selection and neighboring loci 
depends on local recombination rate, there is a significant negative correlation 
between genetic diversity and recombination rate (Burri et al. 2015; Vijay et al. 
2017). Because recombination rate variation is likely conserved between species 
(Singhal et al. 2015; Kawakami et al. 2017), patterns of genetic diversity along a 
genome are also likely similar between species (Burri et al. 2015; Dutoit et al. 2017; 
Vijay et al. 2017). Evaluation of baseline genetic diversity is particularly important 
in genomic scan analyses because measurement of relative genetic divergence 
between species is a function of genetic diversity within species and, consequently, 
low recombination regions tend to stand out as highly differentiated outlier regions 
even without direct involvement in the process of speciation. 

Second, gBGC is a neutral, recombination-associated process that can leave a 
similar genetic footprint as positive selection by distorting the allele frequency 
distribution. Recombination is initiated by the formation of DNA double-strand 
breaks (DSBs), which are subsequently repaired as crossovers or noncrossovers. 
When crossovers occur, there is reciprocal exchange of DNA between homologous 
chromosomes (Fig. 3). During these repair processes, G or C nucleotides are 
preferentially transmitted over A or T nucleotides in regions close to DSBs with 
G:C and A:T base mismatches between paternal and maternal chromosomes (Duret 
and Galtier 2009; Mugal et al. 2015). Since gBGC takes place more frequently in 
regions experiencing frequent DSBs and recombination, highly recombining regions 
are more strongly affected by gBGC with stronger transmission bias toward G:C 


Fig. 3 DNA double-strand 
breaks (DSBs) are repaired as 
either crossovers (COs) or 
noncrossovers (NCOs). 

GC-biased gene conversion 
(gBGC) results from biased 
incorporation of GC over AT 
nucleotides in regions close to 
DSBs with base mismatches 
between paternal and maternal 
chromosomes (red and blue) 
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nucleotides. While positive selection increases allele frequency of “better-fit” alleles 
by virtue of their selective advantages, gBGC spreads G:C alleles independent of 
their effect on fitness. This causes a serious challenge in detecting a signature of 
selection because the strong effect of gBGC in high recombination regions can drive 
the fixation of potentially deleterious G or C alleles and, hence, counteract natural 
selection. In addition, the skewed allele frequency distribution by gBGC relative to 
neutral expectation can also affect inferences of demographic history and natural 
selection based on various population genetic statistics (Bolivar et al. 2018; Pouyet 
et al. 2018). Altogether, we must estimate the baseline genetic diversity by taking 
into account the effect of linked selection and gBGC in order to infer demographic 
history and detect signatures of selection (Mugal et al. 2015). Forward simulation 
approaches that take into account the variation of recombination rate, gene density, 
background selection, and demographic events can provide analytical framework to 
simulate genome-wide patterns of genetic diversity and divergence, with which an 
empirical data can be compared in order to detect outlier regions (Comeron 2017). In 
addition, machine learning approaches can jointly estimate effective population sizes 
and the impact of linked selection (both background selection and selective sweep) 
on the pattern of genetic diversity (Schrider and Kern 2016, 2018; Schrider et al. 
2016; Sheehan and Song 2016). 


8 Phylogeography: The Interface Between Population 
Genetics and Phylogenetics 

The early study of mitochondrial DNA lineages when PCR and DNA sequencing 
became available (Wink 2019) revealed that branches of intraspecific gene trees 
often followed striking geographic patterns (Avise et al. 1987). The study of the 
relationship between gene genealogies and geography became known as 
phylogeography (Avise 2000). Some early examples of phylogeographic studies 
on avian mtDNA include snow goose (Anser caerulescens ) (Avise et al. 1992; Quinn 
1992), northern flicker (Colaptes auratus) (Moore et al. 1991), and common grackle 
(Quiscalus quiscula ) (Zink et al. 1991). Phylogeography provides a bridge between 
phylogenetics (i.e., the reconstruction of evolutionary relationships) and population 
genetics, describing how genetic variation—introduced by mutation (see Sect. 3) — 
is geographically structured within and between populations by population genetic 
processes, such as genetic drift (see Sect. 4), gene flow (see Sect. 5), selection (see 
Sect. 6), and recombination (Sect. 7). For populations that have been separated 
historically and have experienced little or no gene flow, genetic differences can 
accumulate by these evolutionary processes, potentially resulting in speciation 
(Ottenburghs 2019). 

Phylogeography relied heavily on non-recombining and rapidly evolving mtDNA 
to match gene genealogies with geography (Avise 2000). The advent of genomic 
data in combination with the development of coalescent theory (Kingman 1982a, b) 
has revolutionized the field (Edwards et al. 2015). In general, the application of next- 
generation sequencing technologies uncovers more detailed population structure that 
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is often missed by traditional markers, such as mtDNA and microsatellites. For 
example, using RADseq data, Ruegg et al. (2014) were able to more reliably 
distinguish between eastern and western populations of the Wilson’s warbler 
(Cardellina pusilla) compared to previous studies based on mtDNA (Kimura et al. 
2002; Paxton et al. 2013) and AFLPs (Irwin et al. 201 1). In addition, the application 
of multilocus datasets revealed that different genes often result in different gene trees 
(Degnan and Rosenberg 2009). This phylogenetic incongruence can provide a more 
detailed picture of population history because different gene trees capture particular 
historical events and population genetic processes that have shaped the present 
patterns of genetic diversity. However, recent work has also uncovered high levels 
of reticulation due to recombination (see Sect. 7) and gene flow (Edwards et al. 
2016). New statistical methods are being developed to deal with such reticulated 
scenarios (Dai et al. 2010; Ottenburghs et al. 2017a; Zhu et al. 2018). 


9 Conclusions and Outlook 

In this chapter we mostly dealt with questions relating to which inferences we can 
make from genetic variation data on a population scale, with respect to what we 
know about geological events as well as current and past geography. Traditionally, 
the study of phylogeography has a strong focus on demographic processes and 
distribution of genetic variation in time and space. The introduction of genomic 
techniques dramatically increases the statistical power with which we can answer 
questions and describe systems. In sections about the source, maintenance, and loss 
of genetic variation, we introduced the concepts of natural and sexual selection. This, 
in contrast to neutral variation that is shaped by demography, is the second and 
perhaps more innovative major addition that population genomics brings us com¬ 
pared to population genetics. 

Measuring genetic variation everywhere in the genome, including both neutrally 
and adaptively evolving regions, allows us to understand demography and adapta¬ 
tion concurrently. Studies into the functional variation have so far only been possible 
on the interspecies level. Many studies in the past have analyzed the evolutionary 
history of genes known to be involved in key adaptations of a certain lineage. For 
instance, innate immunity in birds is well studied on the avian lineage scale. Cheng 
et al. (2015) deciphered evolutionary signals in effector molecules of the immune 
defense such as defensins and cathelicidins, and Velova et al. (2018) studied 
membrane proteins that control the identification and recognition of pathogens, the 
toll-like receptors. However, from our point of view, the really interesting studies are 
those taking into account the population approach and using population genetic 
theory to measure selection pressure. The first toll-like receptor population 
re-sequencing paper on wader species revealed purifying selection and domain- 
specific evolution (Raven et al. 2017). This was not a genome-wide study and lacked 
comparative information from a number of related genes, so the final conclusions 
remain tentative, but such studies are important next steps in understanding the 
relationship between functional and adaptive variation and offer a glimpse of what 
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may become possible in the future. A similar study on bird species with rather 
different phylogeographic histories curiously rejected the impact of natural selection 
(here, supposedly pathogen pressure) on the molecular evolution on this receptor 
family. Instead, the authors found that drift in small populations overrides the effects 
of natural selection (Gonzalez-Quevedo et al. 2015) as is expected on theoretical 
grounds under such conditions (Lynch 2007). Chapman et al. (2016) studied several 
toll-like receptor genes both on the lineage and population scale to place their 
evidence for diversifying selection into the broader evolutionary framework. Yet, 
the individual gene was studied in isolation from the rest of the genome. Whole- 
genome information on a population scale will make possible the study of selection 
patterns and their interaction with phylogeography, as the costs of sequencing 
continue to decline (Kraus and Wink 2015). New approaches to study functional 
variation not only within gene families (gene-centric) but also within and across 
biological pathways (pathway-centric) are now becoming possible. Jax et al. (2018a) 
showcase interactions between the functional variation and molecular evolution of 
more than 100 genes across several immune pathways in mallards and closely 
related ducks around the world and their geographic origin. The future development 
of population genomics might thus culminate in a more functional approach to avian 
evolution. 
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Long-term studies on birds have played a pivotal role in addressing important 
questions in evolutionary biology, and avian biologists were quick at adopting 
new genetic tools. The integration of genetic work on birds has revolutionised the 
way we think about avian mating systems, for example. During the last decade, 
we have seen a tremendous decline in the cost of sequencing, making it possible 
to genotype or sequence hundreds or even thousands of individuals. These tools 
are offering an exciting new array of questions to be asked from long-term 
longitudinal studies on birds. We review here some of the genetic resources 
currently available to researchers studying avian population samples, some sta¬ 
tistical approaches to analyse population-level genomic data and future questions 
that long-term studies on birds can provide insights into. It is clear that genomic 
approaches on long-term studies on birds have played, and will continue to play, 
an important role for addressing fundamental evolutionary questions. 
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1 Introduction 

The technical ability to trace inheritance of phenotypic traits at the resolution of 
single nucleotides opens up exciting novel possibilities for studies on ecology and 
evolution. In this chapter, we outline some statistical methods associated with these 
technical developments and discuss major advantages and obstacles faced by 
researchers that apply these methods to wild populations of birds. We start with a 
short description of the historical perspective where we argue that long-term popu¬ 
lation studies on birds have played a central role for the development of the whole 
field of studying inheritance of phenotypic traits in nature. 

Long-term studies on birds have traditionally played an important role in gaining 
insights into central theories of ecology and evolution. A major underlying reason 
for the focus on birds is the relative ease by which large numbers of individuals can 
be marked, measured, blood sampled as well as having their reproductive perfor¬ 
mance followed in great detail under natural conditions. The first long-term individ¬ 
ual-based field studies were typically run by highly skilled and enthusiastic 
naturalists, who focused on particular species of birds that were easy to monitor. 
These early scientists include Huibert Kluijver (1902-1977) and David Lack 
(1910-1973) studying great tits (Parus major) in The Netherlands and England, 
respectively, Margaret Nice (1883-1974) studying song sparrows (Melospiza 
melodia) in the USA and Lars von Haartman (1919-1998) studying pied flycatchers 
(Ficedula hypoleuca) in Linland. By collecting information on phenotypic traits 
(e.g. morphology and behaviour) measured at the level of individual birds together 
with information about their survival and reproductive performance (i.e. fitness), it 
became possible to actually measure the strength and direction of the main drivers of 
evolution, natural and sexual selection. Another major advantage with many long¬ 
term studies of birds is that extensive pedigree information can be gathered, which is 
a crucial asset for quantitative genetic approaches. 

The opportunity to combine measures of natural and sexual selection with 
estimates of genetic parameters means that evolutionary responses to environmental 
change can be studied in real time. One of the best examples of how important this 
type of individual-based field studies can be for increasing our general understanding 
of evolutionary events taking place on a contemporary time scale comes from Peter 
and Rosemary Grant’s long-term study on Darwin’s finches on the Galapagos 
Islands. A population of medium ground finches (Geospiza fords) living on the 
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island of Daphne Major experienced two serious periods of drought (in 1976-1977 
and 1984-1986), which inferred changes in food supply resulting in corresponding 
changes in natural selection acting on beak morphology. Due to detailed collection 
of data, the Grants were able to predict the phenotypic responses in beak morphology 
observed in the population based on estimates of natural selection and heritability 
(Grant and Grant 1995). This study hence nicely illustrates evolution in action. The 
quantitative genetic approach used by the Grants and their colleagues (e.g. Boag and 
Grant 1978) was based on the classical framework developed by Fisher, who made 
the assumption that the genetic variance present in a population was based on a large 
number of Mendelian factors, each having a small additive effect on the phenotype 
(Fisher 1930). Together with Haldane and Wright, Fisher founded the discipline of 
population genetics that formed an important part of the whole modern evolutionary 
synthesis where several fields of biology, including ecology, genetics, systematics 
and palaeontology, were brought together through a joint acceptance of evolution as 
the central unifying theory of biology. 

The quantitative genetic framework developed by Fisher, Haldane and Wright not 
only played an important role for the development of modem evolutionary theory; it 
also formed the basis of the discipline of conservation biology that focuses on 
endangered organisms’ ability to adapt to human activities (e.g. Stockwell et al. 
2003) and laid the foundation for applied animal- and plant-breeding programmes. 
The latter applied fields, in turn, contributed to further development of statistical 
tools like the animal model, which consider covariance between multiple sets of 
relatives at the same time and thereby allows more detailed dissections of genetic 
and environmental effects (Henderson 1950). In the animal model, phenotypic and 
pedigree data is combined to estimate genetic parameters such as additive genetic 
variance (V A ) of traits or additive genetic covariances between traits. Not surpris¬ 
ingly some of the first studies using the animal model in the context of natural 
populations were based on long-term studies on birds, including collared flycatchers, 
great tits, long-tailed tits and house sparrows (Jensen et al. 2003; MacColl and 
Hatchwell 2003; Sheldon et al. 2003; Garant et al. 2004). One example of how the 
animal model can be combined with long-term breeding data to test central evolu¬ 
tionary theories comes from a study using 24 years of data on collared flycatchers 
breeding on the Swedish island Gotland in the Baltic Sea. Qvarnstrom et al. (2006) 
used this data set to test all the key genetic requirements of both Fisherian (Fisher 
1930) and ‘good-gene” models (Zahavi 1975) on sexual selection. They found 
significant heritability for all key components: the ornament (male forehead patch 
size), female choice based on the ornament, male fitness and female fitness. How¬ 
ever, when the required genetic correlations between these components were taken 
into account, the estimated strength of indirect selection on female choice was 
negligible (Qvarnstrom et al. 2006). This means that female mate choice based on 
forehead patch size most likely evolves in response to direct selection, 
e.g. depending on resources and parental care provided by the males. Selection of 
female choice based on male forehead patch size moreover varies across years such 
that females paired with large-patched males experienced high relative fitness only 
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during dry summer conditions (Robinson et al. 2012). Additionally, a decline in the 
mean size of this trait across the study period appears connected to the warmer 
climate (Evans and Gustafsson 2017), although the physiological explanation to the 
connection between weather and relative performance of large-patched males 
remains unknown. Thus, the effects of environmental heterogeneity can be 
quantified and taken into account in the animal model, something that can increase 
accuracy in predicting evolutionary responses and in testing core evolutionary 
theories. 

Most quantitative genetic models (like the animal model) provide insights into the 
genetic variance and covariance of traits but not the underlying loci. To identify the 
QTL (quantitative trait loci) that contribute to genetic and phenotypic variation, we 
need genotype information for each individual and phenotypic data. 


1.1 What Do We Gain By Identifying Genes in Natural Bird 
Populations? 

Knowledge about the structure and function of genes at the molecular level was until 
recently restricted to molecular genetic studies on model organisms in the lab. 
However, the technical ability to trace inheritance of phenotypic traits at the 
single-nucleotide resolution that is now emerging opens up the possibility to answer 
novel sets of evolutionary questions (Ellegren and Sheldon 2008; Barrett and 
Hoekstra 2011; Losos et al. 2013). For example, what is the number and effect 
size of the loci involved in the determination of the trait in focus? Are genetic 
correlations between traits a result of pleiotropic effects or due to linkage? Where are 
these loci located in the genome? What are the local rates of recombination and 
mutations at these sites? How is genetic variation maintained at the level of the loci? 
What selective forces operate at the level of the loci? By answering this type of 
questions, a more detailed understanding of the principles and mechanisms of 
evolution and speciation is within reach. 


2 Genomic Resources in Avian Population Research 

Avian biologists were quick at adopting new genetic resources and used multilocus 
DNA fingerprinting probes to examine mating system variation already in the 1980s 
(Burke and Bruford 1987). Historically, birds were considered to be among the 
animal groups with the highest levels of social and genetic monogamy (>90%; Lack 
1968). The use of DNA fingerprinting, and subsequently microsatellites (SSR), have 
however clearly shown that birds are far from monogamous; indeed less than 25% of 
species are in fact genetically monogamous (Griffith et al. 2002). This finding has 
revolutionised the field of avian behavioural ecology and our understanding of 
sexual selection in birds. 

While microsatellites continue to be the genetic marker of choice for many 
applications, the decreasing costs of high-throughput sequencing mean large amount 
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Fig. 1 A selection of some long-term pedigree based studies on birds [(a) Collared flycatcher, (b) 
great tit, (c) house sparrow, (d) bam swallow and (e) blue tit] for which genomic resources are 
available. Photo credits: (a) Eryn McFarlane, (b and c) Martin Lind, (d and e) Arild Husby 


of SNP markers are easy to generate, and they also have lower error rates than 
microsatellites (Kraus et al. 2015). This means it is now possible to construct custom 
SNP arrays for genotyping hundreds or even thousands of birds (e.g. Kraus et al. 
2011, 2013; van Bers et al. 2012; Hagen et al. 2013; Kawakami et al. 2014; 
Lundregan et al. 2018), much like that seen in the field of human genetics. 

Nevertheless, independent of which markers are used, a key consideration in all 
population studies is the need to genotype or sequence a large number of individuals 
to have reasonable statistical power to study selection or for trait mapping (Kardos 
et al. 2016). Hence, even a relatively low cost for genotyping or sequencing an 
individual will quickly become expensive when hundreds or even thousands of 
individuals are needed. For this reason there are still relatively few model species 
where population scale genomic data are available (Fig. 1, Table 1). Below we 
review recent developments of genetic tools within the field of avian genetics, 
focusing on large-scale individual-based population studies. 


2.1 Whole Genome Sequencing Efforts 

The availability of sequenced bird genomes has increased exponentially over the last 
few years, and several avian ecological model species have now been sequenced 
(Ellegren 2013; Zhang 2014; Vignal and Eroy 2019). The collared and the pied 
flycatcher were the first ecological model bird species to have their genome 
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Table 1 Some examples of wild bird systems in which individual phenotypic and genomic data 
have been collected 


Species 

Available genomic resources 

Reference 

Blue tits (Cyartistes 
caeruleus) 

Genome, transcriptome, 12 K 
SNP array 

Mueller et al. (2016) and 
Szulkin et al. (2016) 

Collared and pied flycatchers 
(Ficedula albicollis and 

F. hypoleuca) 

Genome, transcriptome, 50 K 
Illumina SNP array 

Ellegren et al. (2012), Uebbing 
et al. (2013) and Kawakami 
et al. (2014) 

Red junglefowl and 
domestic chickens ( Gallus 
gallus ) 

Genome, transcriptome, 
commercially available SNP 
arrays 

Gering et al. (2015) 

Darwin’s finches ( Geospiza 
sp., including all species in 
radiation) 

Genome 

Lamichhaney et al. (2015) 

Red grouse ( Lagopus 
lagopus scotica) 

384 SNP array 

Wenzel et al. (2015) 

Superb starling 
(Lamprotornis superbus ) 

Transcriptome, 102 SNPs 

Weinman et al. (2015) 

Great tits ( Parus major) 

Genome, transcriptome, 
methylome, 10 K SNP array, 

650 K SNP array 

van Bers et al. (2012) and 

Laine et al. (2016) 

House sparrows ( Passer 
domesticus) 

Genome, transcriptome, 10 K 
and 200 K SNP arrays 

Hagen et al. (2013) and Elgvin 
et al. (2017) 

Ruff (. Philomachus pugnax ) 

Genome, transcriptome 

Kiipper et al. (2016) and 
Lamichhaney et al. (2016a) 

Attwater’s prairie chicken 
(Tympanuchus cupido 
attwateri ) 

~20 K SNP array 

Bateson et al. (2016) 

Zebra finch 

Genome, transcriptome, 

6000 K SNP array 

Knief et al. (2017) 


sequenced (Ellegren et al. 2012), following sequencing of the chicken (Hillier et al. 
2004), zebra finch (Warren et al. 2010), turkey (Dalloul et al. 2010) and domestic 
duck (Huang et al. 2013) genomes. The sequencing of the collared and pied 
flycatcher marked a significant milestone in avian genomics because it was done 
by a small group of scientists rather than a large consortium. The genomes of other 
important bird species in evolutionary ecological research have followed swiftly as 
sequencing methodology has advanced and prices declined; this includes among 
others great tit (Laine et al. 2016), Darwin’s finches (Lamichhaney et al. 2015), blue 
tit (Mueller et al. 2016), bam swallow (Safran et al. 2016) and house sparrow (Elgvin 
et al. 2017). 

In addition to the species with long-term data listed above, bird genomes are 
sequenced on a daily basis and the goal of the Avian Phylogenomics Consortium 
Bird 10 K project (Zhang 2015) is to have a draft genome from all extant bird species 
done before 2020 (https://blOk.genomics.cn), as of July 2017 (last time website has 
been updated) about 300 species have had their genome sequenced. 
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Genome sequencing studies have been instrumental for the technological devel¬ 
opment of avian genomics (e.g. Kraus and Wink 2015) and have generated many 
important insights into avian genome evolution and provided insights into the 
genetic basis of adaptation (Ellegren 2013, 2014). One illustrative example is the 
use of whole genome sequencing of the ground tit (Parus humilis ), an endemic 
species in the Paridae family living in the Tibetan plateau, to gain insight into the 
genetic basis of high-altitude adaptation. This species is found exclusively above the 
tree line on rocky steppes and grasslands and has evolved a number of characteristic 
features such as longer and distinctly downward curved bill, larger body size and 
longer legs (used for foraging and digging burrows). Qu et al. (2013) used Illumina 
HiSeq 2000 platform to sequence and assemble a female ground tit as well as a great 
tit, a yellow-cheeked tit and a Mongolian ground jay. Using phylogenetic 
reconstructions they demonstrated that the ground tit is indeed closer related to 
other Parus species than jay, to which it is more phenotypically similar. Interest¬ 
ingly, Qu et al. (2013) also found the ground tit genome showed gene expansion in 
energy metabolism and contractions in immune and olfactory perception genes 
compared to the other species, and there were signs of positive selection in genes 
related to hypoxia response and skeletal development. This study demonstrates the 
power of whole genome sequencing to detect the genetic basis of traits involved in 
adaptation to, for example, high altitude in wild bird populations. 

Whole genome sequencing has provided significant insights into avian genome 
evolution, phylogenetics and demography (Ellegren et al. 2012; Nadachowska- 
Brzyska et al. 2013; Bum et al. 2015). However, to study evolution in action 
(i.e. microevolution across ecological time scales), large-scale population samples 
are needed. For this purpose whole genome sequencing is still too expensive [but see 
Kardos et al. (2016) for one example], and here resequencing data can be used to call 
SNPs and design custom SNPs arrays that can be used for large-scale population 
genotyping. 


2.2 Development of SNP Arrays 

SNP arrays can be very useful tools for large-scale population genetic studies and 
trait mapping. SNP discovery can be done using transcriptomic data, sequence data 
or RAD sequence data from a number of individuals such that SNPs can be identified 
as polymorphic in the population and selected for inclusion on the SNP array (Hagen 
et al. 2013; Kawakami et al. 2014; Malenfant et al. 2014). In birds high-density SNP 
arrays were first developed for great tits (van Bers et al. 2012), followed by the house 
sparrow (Hagen et al. 2013) and the collared flycatcher (Kawakami et al. 2014). 
Small-scale custom arrays have also been used in red grouse (Wenzel et al. 2015) 
and zebra finch (Knief et al. 2017). 

Depending on the type of SNP to be included, either one or two probes per locus 
are needed, where A/T and C/G SNPs require two probes (one probe for each allele, 
so-called Infinium type I probe design in Illumina terminology). The cost of a 
custom array depends on the number of probes (bead types) included rather than 
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the number of loci, and thus the Infinium type II probe design is more common. 
Following successful SNP identification, a list of SNPs with sequence information 
50-100 bp up- and downstream of the SNP is sent to the company (Illumina/ 
Affymetrix) for quality control using their proprietary assay design tools to make a 
final SNP selection to include on the array. Most custom SNP arrays on wild bird 
populations have seen success rates above 90% (i.e. 90% of included SNPs on the 
array passed the assay design and were polymorphic), with some exceptions (about 
75% in house sparrows Hagen et al. 2013). 


2.3 Reduced Representation Techniques or SNP Arrays? 

For genotyping a large number of individual custom-designed SNP arrays have been 
considered the medium of choice. However, reduced representation sequencing 
techniques, where restriction enzymes are used to cut out parts of the genome 
which are then sequenced, can also be a very cost-efficient way of obtaining 
genotype information (Andrews et al. 2016). This cost is somewhat traded off with 
the increased need to have bioinformatics skills and resources to do proper post¬ 
processing of sequence data, which are more substantial compared to the genotype 
data one obtains from SNP arrays. In particular, quality of genotype calling is lower 
than in SNP arrays and depends critically on the coverage. However, the flexibility 
of RAD seq to target varying amounts of loci and coverage means that the cost per 
genotype might be lower than with SNP arrays and the flexibility higher. 

Which technique is better suited is also a question of the available genomic 
resources for the species in focus; RAD seq can be used even if no previous genomic 
resources are available, whereas SNP array (preferably) needs information about 
genomic location as well as sequence information around the SNPs. 


3 Statistical Genetic Methods Used to Analyse Avian 
Population Samples 

3.1 The Decline of Linkage Analysis and Rise of GWAS (Genome- 
Wide Association Studies) 

A number of studies have used linkage analyses in natural populations of birds to 
detect QTL of complex traits. For example, Tarka et al. (2010) genotyped 333 indi¬ 
vidual great reed warblers for 45 autosomal and 6 sex-linked microsatellites/AFLP 
markers and located a QTL for wing length on chromosome 2 that explained more 
than a third of the phenotypic variance (Fig. 2). A series of studies on a captive 
population of zebra finches have also used linkage mapping to detect QTL for beak 
morphology (Knief et al. 2012), beak colouration (Schielzeth et al. 201 lb) and wing 
length (Schielzeth et al. 2011a). 

Linkage mapping requires both a genetic map and a pedigreed population to be 
able to determine co-segregation of markers with the phenotype (Slate 2005; 
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Fig. 2 The long-term study of great reed warblers from lake Kvismaren, Sweden, is one of the few 
studies that have attempted to replicate a QTL from a linkage analysis with a GWAS in a natural 
population, (a) Linkage plot for wing length QTL on chromosome 2 in great reed warblers from 
Tarka et al. (2010). (b) Manhattan plot for the same trait from the GWAS using RAD sequencing 
data. Adapted from Hansson et al. (2018). (c) A great reed warbler from the long-term study 
population at Kvismaren. Photo credit: William Velmala 


Slate et al. 2010). Thus, in principle a linkage map to establish position of markers 
would be needed for the species in which one aims to do linkage mapping. For 
example, a three-generation, captive pedigree was used to establish a linkage map in 
zebra finches (Stapley et al. 2008), and a nestbox population with a two-generation, 
half sib design was used to design a linkage map in collared flycatchers (Backstrom 
et al. 2006). However, work in the collared flycatchers has also demonstrated that 
synteny (preservation of gene order among different species) with the chicken and 
zebra finch genome is high (Backstrom et al. 2006, 2008a, b), something that 
facilitates construction of genetic maps in species without a linkage map. 

While linkage analysis has been successfully used in some bird species, it is 
limited by the need for a pedigreed population to trace the co-segregation of marker 
loci and phenotypes (Slate et al. 2010) and in general has low power to detect 
anything but loci of major effect (Santure et al. 2015). Partly for this reason, 
association mapping, which relies on historical recombination events rather than 
recombination events within a pedigree, has become popular. The use of historical 
recombination events also allows QTL region to be localised to a narrower region 
than when utilising recombination events within a pedigree. The other main reason 
for the rise in association mapping studies is no doubt that high-throughput sequenc¬ 
ing costs have declined to the extent that it is feasible to resequence a number of 
individuals to generate SNP arrays for large-scale genotyping (Ellegren 2014). 

For example, SNP arrays used for association mapping purposes are now avail¬ 
able in a number of species (Table 1), and GWAS have been applied to great tits 
(Santure et al. 2015), collared flycatchers (Husby et al. 2015; Kardos et al. 2016), red 
















276 


A. Husby et al. 


grouse (Wenzel et al. 2015) and house sparrows (Silva et al. 2017). Generally, these 
studies have found evidence for polygenic architectures of quantitative traits (Kardos 
et al. 2016; Santure et al. 2015). The application of GWAS to wild bird populations 
has allowed finer-scaled studies of genetic architectures than was possible with 
linkage analyses because the identified regions are much smaller in association 
mapping. 

While many of the early avian studies cited above that have used dense marker 
panels and GWAS have been in pedigreed populations, this need not be the case 
(Speed and Balding 2015). Dense marker panels could be used to infer genomic 
relatedness between individuals within the same generation (i.e. over only one 
breeding season of study) and could lead to more precise, rather than inferred, 
estimates of relatedness (Berenos et al. 2014). 


3.2 Chromosome Partitioning 

In order to assess if the genetic architecture of traits are polygenic, Yang et al. (2011) 
proposed partitioning genetic variance onto chromosomes and testing for a positive 
correlation between the size of a chromosome and the amount of genetic variance 
that it explained. If particular chromosomes explained more genetic variance than 
would be expected based on chromosome size, then this could indicate either a 
marker of large effect, or as cluster of markers each contributing a small effect, and 
could point to chromosomes, and subsequently regions, that would be interesting for 
further inspection (Schielzeth and Husby 2014). In birds, chromosome partitioning 
has now been done in great tits (Robinson et al. 2013; Santure et al. 2015), red 
grouse (Wenzel et al. 2015), collared flycatchers (Silva et al. 2017) and house 
sparrows (Silva et al. 2017). The chromosome partitioning results from these studies 
tend to conform to a polygenic basis to most traits, but there are large variations in 
regression slopes and outlier chromosomes in all these analyses. 

At present it is not clear if this variation is biologically interesting or a result of 
variation in sample size and in marker density on the different chromosomes. 

Chromosome partitioning can be a useful tool for traits for which there are no 
clear markers of large effect from a GWAS, because a positive correlation between 
chromosome size and proportion of variance explained would indicate a polygenic 
basis to traits (Kemppainen and Husby 2018a). However, the power of this approach 
depends on factors such as heritability of the trait, the QTL effect size distribution, 
clustering of loci in the genome and sample size (Kemppainen and Husby 2018a), 
and thus its suitability to infer a polygenic basis of traits should be carefully 
evaluated on a per case basis. Also, variation in chromosome size and numbers 
used in the analysis influences power with lower power when there is small variation 
in chromosome sizes (Kemppainen and Husby 2018a). These factors, combined with 
earlier studies using a biased test for inferring polygenic architecture (Kemppainen 
and Husby 2018b), mean that careful consideration and interpretation of chromo¬ 
some partitioning results is needed and that future studies should use the correction 
introduced by Kemppainen and Husby (2018b). 
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3.3 Admixture Mapping 

Admixture mapping is another genetic technique that can be useful in wild bird 
populations if there is frequent hybridisation and backcrossing (Ottenburghs et al. 
2015). As such it can be a useful tool if there is large-scale population data and 
phenotype data in avian hybrid zones (Ottenburghs et al. 2017). For example, 
Delmore et al. (2016) used hybridising Swanson’s thrushes to dissect the genetic 
basis of migratory orientation and plumage colour. By using multiple-SNP Bayesian 
models, these authors were able to consider all SNPs together and control for factors 
that influence phenotypes and correlated with genotypes such as population struc¬ 
ture. Several regions associated with migration and colour were reported including 
TYRP1 (i.e. previously known to associate with plumage colour in quail; Nadeau 
et al. 2007) and several genes involved in brain development and circadian locomo¬ 
tor behaviours (Delmore et al. 2016). Another example of admixture mapping in 
birds is a recent study on golden-winged (Vermivora chrysoptera) and blue-winged 
(V. cyanoptera) warblers, which hybridise across a broad zone of eastern North 
America. Toews et al. (2016) assayed SNPs in divergence peaks (i.e. genomic 
regions of high divergence between these two species) in a few hundred individuals 
with known phenotypes that were sampled from both allopatric and sympatric 
geographic regions. One of the most intriguing findings from this study was a perfect 
correlation between the agouti signalling protein ASIP region and throat colouration 
(Toews et al. 2016). This black throat colouration of golden-winged warblers was 
identified as a Mendelian recessive trait already in 1908 (Nichols 1908) and is absent 
in blue-winged warblers and FI hybrids. Interestingly, associations between ASIP 
and recessive melanistic phenotypes have been found in other vertebrates (Hoekstra 
2006). 


4 What Have We Learned About the Genetic Architecture 
and Evolution of Complex Traits from Studies on Birds? 

4.1 Genomic Vs. Pedigree Heritability 

The narrow sense heritability ( h 2 ), the proportion of the phenotypic variance 
explained by additive genetic variance, is a central parameter in evolutionary biology 
because it determines the expected response to selection. Traditionally, this has been 
estimated using different breeding designs (parent offspring regression, full sib and 
half sib analyses), but during the last 20 years, pedigree data has become the 
preferred way to calculate h 2 (Lynch and Walsh 1998). Collecting pedigree data is 
of course a laborious process because all individuals in the population must be 
marked and followed for generations so that relatedness can be inferred. Indeed 
several active long-term studies on birds extend back for 30 or more years (e.g. great 
tits in Hoge Veluwe in the Netherlands and in Wytham Woods in Oxford, UK, 
collared flycatchers, Gotland, Sweden). 
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It is now easy to obtain tens of thousands of markers on many individuals and use 
this molecular information to recreate pedigrees or even use estimates of realised 
relatedness (as opposed to expected relatedness) directly to estimate heritability 
(Husby et al. 2015; Santure et al. 2015; Silva et al. 2017; Gienapp et al. 2017). Do 
the possibilities to genotype individuals and assign relatedness based on molecular 
markers mean that long-term pedigreed studies are becoming redundant? If the main 
goal is to estimate evolution in action and/or to study mechanisms of selection in 
detail, the answer to this question is clearly no. Long-term pedigreed populations are 
also useful for estimating heritability because reliable estimates need large sample 
sizes, preferably also of individuals that are distantly related and hence less likely to 
share environment. Moreover, longitudinal pedigreed studies with repeated sam¬ 
pling of the same individual in different environmental conditions also allow for 
studies of genetics of plasticity (e.g. Husby et al. 2011), something that would not be 
possible if sampling just a single time point. 

On the other hand, genotyping populations and using the molecular markers to 
estimate relatedness could improve pedigree estimates if these have been inferred 
from observations (as is the case with many long-term studies on passerines). Using 
marker information to estimate relatedness can also improve heritability estimates 
because of variation in relatedness around expected IBD sharing due to recombina¬ 
tion and Mendelian sampling (Visscher et al. 2008). 

For the above reasons, it is not always clear what approach is best for estimating 
heritability, and the choice in many ways probably comes down to practical factors. 
Also, in cases where the two approaches have been compared (i.e. using pedigree or 
using realised relatedness from marker data), they have been largely concordant 
(Berenos et al. 2014; Santure et al. 2015). It is important to keep in mind that such 
comparison is valid only when there is strong family structure in the data. On 
unrelated individuals, which is typically used in most human studies, the heritability 
estimates using realised relatedness is much smaller because it no longer measures 
pedigree relatedness, but the proportion of variance captured by the SNP array 
(de Los Campos et al. 2015). In family data the SNPs capture the long-range LD 
found between family members and hence reflect pedigree relatedness rather than 
proportion variance explained by the SNPs. Thus, what the genomic heritability 
measures depends on the pedigree structure in the sample. 


4.2 Many Small or a Few Large Genes? 

A fundamental assumption in quantitative genetics is the infinitesimal model (Fisher 
1930), which postulates that traits are governed by many (infinite) genes of individ¬ 
ually small effect. While this assumption is of course incorrect (after all there is a 
finite number of genes in an individual), the infinitesimal model has played, and 
continues to play, an important role in evolutionary theory (Nadeau and Jiggins 
2010; Rockman 2011). 

Linkage mapping studies on birds such as those by Tarka et al. (2010) on great 
reed warblers and the series of studies on a captive population of zebra finches to 
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detect QTL for beak morphology (Knief et al. 2012), beak colouration (Schielzeth 
et al. 2011b) and wing length (Schielzeth et al. 2011a) found QTL that explained a 
sizeable portion of the phenotypic, and not least genetic, variance. Taken together 
these studies suggested that the genetic architecture consists of at least some genes of 
large effect that can be found even using sparse marker maps and moderate sample 
size. Although the studies acknowledged there might be some inflation of effect sizes 
due to the low number of individuals used (‘Beavis effect’; Beavis et al. 1994), the 
extent of the proportion of variance explained by the QTL in these studies was 
nevertheless a surprise. Not least as work on Drosophila had demonstrated that most 
traits are governed by many genes of individually small effect (Mackay et al. 2009), 
as expected (and assumed) in quantitative genetic models (Falconer and Mackay 
1996). 

The picture emerging from the latest association studies is however a very 
different one because they almost all identify loci of smaller effect, if any significant 
loci at all (Santure et al. 2013, 2015; Husby et al. 2015; Kardos et al. 2016; Silva 
et al. 2017; Lundregan et al. 2018; Hansson et al. 2018). 

This is not to say large effect QTL do not exist in birds; they obviously do 
(e.g. Farrell et al. 2013; Kiipper et al. 2016; Mundy et al. 2016; Tuttle et al. 2016). 
However, when considering the inflation of effect sizes common to mapping studies 
of most current avian studies (Slate 2013), it seems prudent to emphasise that 
polygenic architectures seem to be the norm. For example, GWAS and chromosome 
partitioning have been used to determine a polygenic architecture of eight morpho¬ 
logical traits in two replicated populations of great tits (Santure et al. 2015). 
Additionally, a recent study on a sexually selected trait (forehead patch size) in 
collared flycatchers that used a combination of whole genome sequence analyses of 
select individuals with extreme phenotypes and GWAS of a less dense SNP array on 
a larger sample of individuals found no evidence for genes of large effect (Kardos 
et al. 2016), in spite of using whole genome resequencing data. Given the short 
chromosomal distances that LD typically reaches in bird populations (e.g. Kardos 
et al. 2016), the somewhat low densities of markers used in many studies of wild bird 
populations and the possibility for inflated effect sizes (i.e. the Beavis effect; 
e.g. Slate 2013), caution should be advised when proposing oligogenic architectures 
of traits. 


4.3 Insights into Factors That Can Maintain Genetic Variance 

While identifying QTL can inform us about the genetic architecture of traits, an even 
more interesting aspect of detecting QTL is the possibility to examine the evolution¬ 
ary forces acting on them (Ellegren and Sheldon 2008; Schielzeth and Husby 2014). 
For example, a major unresolved question is how genetic variance can be maintained 
in natural populations given that many traits seem to be under persistent directional 
selection (Kingsolver et al. 2001). Providing further insights into this question 
requires that we know the QTL and have allele-specific fitness estimates because 
we can then test evolutionary hypotheses that can maintain genetic variance, such as 
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Fig. 3 An example of a GW AS in collared flycatchers identifying some loci related to variation in 
clutch size (Husby et al. 2015). (a) A female collared flycatcher alarms outside a nestbox. Photo 
credit: Arild Husby. (b) A collared flycatcher nest from the long-term study population on Oland, 
Sweden, where the median clutch size in the population is six eggs. Photo credit: Arild Husby. (c) 
Manhattan plot for clutch size with genome-wide significant QTL on chromosome 18 and two 
suggestive ones on chromosome 9 and 26. From Husby et al. (2015). (d) Fine-scale association plot 
and patterns of linkage disequilibrium underlying the QTL on chromosome 18 


intra-locus sexual conflict, heterozygote advantage (or overdominance in general) or 
fluctuating selection. Long-term studies on birds are in a unique situation to answer 
these questions because of the detailed fitness data and genotype data that is 
available on the individual level. 

To date most studies have focussed on detection of QTL rather than testing 
evolutionary hypothesis of what can maintain variation at the QTL. One exception 
is work on collared flycatchers where Husby and colleagues used an association 
mapping approach to detect QTL for clutch size (Husby et al. 2015, Fig. 3). They 
identified a genome-wide significant QTL on chromosome 18 (with three SNPs in 
high LD) and a suggestive QTL on chromosome 26. Because the individuals in this 
study are part of a population of collared flycatcher that is intensively monitored 
(Qvamstrom et al. 2010), information about lifetime reproductive success is avail¬ 
able making it possible to test hypotheses for how genetic variation in QTL for 
clutch size can be maintained. Interestingly, for the suggestive QTL on chromosome 
26, there was evidence of intra-locus sexual conflict where females that were 
homozygous for the A allele at the locus had lower fitness compared to males 
homozygous for the same allele. In females, the lower fitness was a result of the 
same allele having a negative effect on clutch size and therefore number of 
fledglings, but the reasons for increased fitness in males homozygous for the same 
allele are unclear. Males homozygous for the G allele had slightly higher annual 
reproductive success and longer lifespan compared to males homozygous for the A 
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allele or heterozygotes. While this study would support intra-locus conflict as a 
mechanism for maintenance of genetic variance of this clutch size QTL, it is 
important to keep in mind that this was only a suggestive QTL and sample size 
was low (n = 309). More work is needed both to replicate the QTL and to confirm 
the fitness patterns seen in this study. Moreover, there was also no indication of any 
fitness differences among genotypes at the genome-wide significant QTL on chro¬ 
mosome 18, suggesting that evolutionary mechanisms that can maintain genetic 
variation are likely different among QTL for the same trait. 

As more research groups genotype thousands of individuals from their long-term 
populations on high-density SNP arrays, we will surely start to see not just the 
discovery of more QTL but also tests of evolutionary mechanisms that maintain 
genetic variation. It is worth keeping in mind though that most traits are polygenic, 
and hence both identification of QTL and detecting fitness differences among 
genotypes will be challenging. For example, the large fitness differences found 
between horn genotypes in Soay sheep (Johnston et al. 2013) and armour plating 
in three-spined stickleback (Barrett et al. 2008) are both near Mendelian traits 
governed by a single locus of major effect ( RXFP2 and EDA , respectively). 


4.4 Replication of QTL 

The considerable effort needed to collect large individual-based samples needed for 
linkage mapping or association studies means that most detected QTL have not been 
replicated and hence cannot be considered causal. The preferable association 
mapping design would be to have a discovery sample and a validation/replication 
sample, but for practical reasons this is difficult in avian studies. First of all most 
researchers only work on a single population of a given species, and second, sample 
sizes are often too small to allow subdivision into a discovery and replication 
population. 

Nevertheless, several studies have now attempted to replicate association signals 
in wild bird populations. Korsten et al. (2010) used a candidate gene approach to 
examine association between personality and a dopamine receptor gene (DRD4) in 
four different great tit populations, of which one was the population where an initial 
association was found. While the initial association was replicated in the discovery 
population, there was no signal in two of the other populations and only a weak 
association in the third population. One potential reason for this could be because of 
differences in linkage disequilibrium around the associated SNP in the different 
populations, but followed-up work has ruled out this explanation (Mueller et al. 
2013). At the moment genetic heterogeneity among populations therefore 
(i.e. different genetic variants involved) seems most likely as an explanation for 
the role of DRD4 in individual variation in personality. 

A very impressive replication study, also in great tits, was done by Santure and 
colleagues (Santure et al. 2015). Using individuals genotyped on a 10 K custom SNP 
array from a population in the Netherlands and in the UK, they carried out quantita¬ 
tive genetic analyses, chromosome partitioning, linkage analyses and association 
analyses for both population samples and eight different morphological traits. The 
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quantitative genetic analyses show that, with the exception of fledgling mass, there 
are no significant population differences in additive genetic variance. Surprisingly 
however, regions of the genome associated with traits in one population were not 
replicated in the other. Given the very weak genetic differentiation between these 
two populations (F sT = 0.01), this is unexpected, and the authors suggest that the 
most likely reason for this lack of replication might be low power to detect causal 
variants in either population (i.e. that the detected regions are false positives) or that 
different loci contribute to trait variation. 

Finally, Knief et al. (2017) attempted to replicate results from earlier linkage 
analyses of a captive population of zebra finches. They genotyped additional 
individuals in the same captive population, from other captive populations as well 
as a wild population of zebra finch for around 700 SNPs in and around linkage 
regions previously identified for different morphological traits (Schielzeth et al. 
201 la, b; Knief et al. 2012). While associations within the QTL regions were highly 
repeatable using an independent set of individuals from the discovery population, 
and also partly replicable in other captive populations, they were weak or 
non-existent in the wild population. The most likely explanation seems to be, like 
above, that the causal variants have not been detected, and thus the association signal 
is rather in LD with a causal variant. 

As these studies demonstrate, replication on QTL is nontrivial even in relatively 
large-scale studies. The low LD in most bird populations poses serious challenges 
not just for initial gene discovery but also for follow-up studies that aim to replicate 
associations. Unfortunately, functional work in birds is still difficult to do, but 
verifying that the QTL have an effect on the trait using knock-down or knock-out 
techniques will be important for the field to move forward elucidating the functional 
role of QTL. 


5 Future Prospects in Avian Genomics 

5.1 Development of Novel Statistical Genetic Methods 

The statistical genetic machinery we currently use in avian genomics studies has 
been developed with human studies in mind and as a consequence might not be 
readily applicable to avian genomic studies. One example of this is association 
studies which use linear mixed effects models with the kinship matrix as a random 
effect to control for population stratification (Aulchenko et al. 2007) and a single 
observation per individual. Association studies in natural populations also use linear 
mixed effects models, but they generally include many more random effects to 
control for factors such as yearly variation, spatial variation or the fact that we can 
have repeated measures from the same individuals (Postma and Charmantier 2007). 
Surprisingly, this is not possible in the association mapping software available today. 
An exception to this is Repeat ABEL which was developed specifically to handle the 
longitudinal data structure common to many avian studies (Ronnegard et al. 2016). 
This R package (built on the popular Gen ABEL R package used in human 
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association studies; Aulchenko et al. 2007) allows for two random effects (kinship 
matrix and repeated measures matrix), which better reflect the data structure of 
repeatedly sampled population data and can also result in higher statistical power 
to detect genetic variants associated with the trait. Hopefully, in the future, this or 
similar software can be modified to also allow other sources of variances to be 
modelled explicitly. 

Another example of a method developed in human genetics and then applied to 
data on natural populations, including many avian study systems, is chromosome 
partitioning (Yang et al. 2011). This method is explained above, but the principal 
idea is to regress individual chromosome level heritability on chromosome size to 
test for a positive association, indicative of a polygenic trait. While this has been 
applied many times, there has been little evaluation of this method as used on natural 
populations. For example, bird karyotypes are unique in that they have very many 
small micro-chromosomes (<10 Mb), and this makes estimation of the 
chromosome-specific relatedness matrices challenging because of few markers on 
these chromosomes (Silva et al. 2017). In addition, sample size of data from natural 
populations is often orders of magnitude lower than in human studies, and it has not 
been verified to what extent this influence our ability to draw inferences about a 
polygenic basis of traits using this approach. 

We are, of course, still in the early stages of applying large-scale genomic data in 
avian population studies, but it is clear that we need to take care not to rush applying 
methods from the human genetics field onto our own data without a careful consid¬ 
eration as to the requirements and assumptions of those models. Future 
modifications or development of new software that can better model the complex 
data that arise from natural populations are likely to play a key part in the develop¬ 
ment and progress of avian genomics. 


5.2 Epigenetics 

There is great interest in epigenetic mechanisms (structural adaptations of chromo¬ 
somal regions so as to register, signal or perpetuate altered activity states; Bird 2007) 
in evolutionary biological studies in general as well as within avian studies specifi¬ 
cally at the moment (Derks et al. 2016; Laine et al. 2016). Part of the excitement has 
to do with this being a new layer of complexity to gene regulation that we so far 
know very little about and that now can be relatively easily targeted using high- 
throughput sequencing technology. In particular, both alterations to chromatin 
structure and DNA methylation, the two main epigenetic mechanisms, can be 
screened using ChIP sequencing and bisulfite sequencing (whole genome or reduced 
representation approach), respectively. So far just a few studies have examined DNA 
methylation levels in natural bird populations, but this is changing rapidly. One early 
example is a study on house sparrows in Kenya that used methylation-sensitive 
AFLPs to score DNA methylation and sequence variation. Liebl et al. (2013) found 
that variation in DNA methylation levels was higher than genetic diversity and 
suggested that variation in DNA methylation could be important adaptive variation 
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as house sparrows in Kenya have been introduced and therefore have low levels of 
genetic diversity but yet have managed to spread over much of the country (Liebl 
et al. 2013). 

We know very little about potential adaptive epigenetic effects in wild bird 
populations, and the epigenetic resources are limited for most species. However, 
the great tit has had its full methylome sequenced recently (Laine et al. 2016). Laine 
and colleagues found that CpG methylation patterns are very similar to that seen in 
mammals (i.e. reduced methylation within CpG sites and around transcription start 
sites) and there was also increased CpG methylation in regions inferred to be under 
selection from sequence data (genes related to neuron development and learning). 
This demonstrates a potential role for epigenetic regulation in adaptive evolution 
(see also Verhulst et al. 2016). 

Epigenetic studies on birds will no doubt increase in the near future. However, 
like for association studies, a significant challenge will be to demonstrate causality. 
This is particularly so for methylation analyses because it is almost certain to be 
tissue specific as well as time dependent. To our knowledge no study has yet 
examined temporal changes in DNA methylation within the same individuals, but 
such changes would be needed if methylation should regulate seasonal variation in 
expression patterns. Some insights into tissue specificity of DNA methylation 
however have been done in the great tit for the DRD4 gene where methylation in 
the brain and blood showed a very high correlation at CpG sites (r p = 0.97, Verhulst 
etal. 2016). 


5.3 Detecting Signs of Polygenic Adaptation 

It is becoming increasingly clear that most (avian) traits are polygenic (i.e. Robinson 
et al. 2013; Santure et al. 2013, 2015; Kardos et al. 2016; Silva et al. 2017), and 
hence adaptation is likely to take place through minor changes in allele frequencies 
at many loci (Berg and Coop 2014), so-called polygenic adaptation, rather than rapid 
allele frequency changes at a single locus or very few loci. Detecting polygenic 
adaptation is challenging because statistical methods to detect selection in sequence 
data have mainly focussed on identifying hard or soft sweeps where the signal of 
allele frequency change is stronger than in polygenic adaption. In contrast, in test of 
polygenic adaptation, loci might not go to fixation at all or will take a very long time 
to do so. Hence, the signal of polygenic adaptation comes from examining the 
covariance in allele frequency change among loci of similar effect (Berg and Coop 
2014). Statistical methods to test for polygenic adaptation have been developed and 
applied on humans (Berg and Coop 2014), but to our knowledge not yet on data from 
natural bird populations. One reason for this is no doubt because this method relies 
on having effect size estimates of individual loci identified in GWAS to compute a 
genetic value that is then tested for local adaptation among replicate populations 
(Berg and Coop 2014). However, most GWAS on natural populations are under¬ 
powered to detect genome-wide significant effects and hence will only capture a 
(very) small proportion of the genetic variance, something that has a negative effect 
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on the statistical power to detect polygenic adaptation (Berg and Coop 2014). Other 
approaches to detect covarying loci, such as random forest algorithms (Laporte et al. 
2016), may be more promising. 


5.4 Microevolutionary Trends and Responses to Selection 

As mentioned in the introduction, one of the general goals of evolutionary biology is 
to demonstrate evolution in action, specifically in ecological time scales. This is 
because only studies done at ecological time scales can reveal details about the 
relative importance and interaction between sources of selection, random genetic 
drift and processes such as genomic conflicts in driving evolutionary changes. 
Genomic techniques not only allow better possibilities to investigate these questions 
on a more fine scale but also facilitate dissections of possible links between evolution 
in action (i.e. microevolutionary changes) and large-scale macroevolutionary 
patterns. For example, it is possible to investigate whether genes underlying ecolog¬ 
ical adaptations also are responsible for the build-up of genetic incompatibilities 
between diverging populations and to follow the long-term fate of key complexes of 
genes (e.g. with established adaptive functions) in large phylogenetic contexts. 

A long-standing interest for avian biologists working on long-term individual- 
based monitoring projects has been to demonstrate that particular traits are 
(a) heritable, (b) under selection and (c) have undergone genetic changes, because 
this would demonstrate microevolution. Demonstrating a response to selection in 
wild populations is not straightforward, and this is especially true for studies aiming 
to investigate responses to ongoing climate change where all three classical 
requirements for detecting microevolution have rarely been met (Gienapp et al. 
2008). Of course, genetic changes can also be due to drift, such that when examining 
genetic trends over time, drift should be modelled explicitly. For example, a study by 
Evans and Gustafsson (2017) found that collared flycatcher forehead patch size has 
evolved to be smaller in response to increased temperatures on Gotland, most likely 
because male flycatchers with small patches survive better over winter in warmer 
years. This evolutionary change was larger than would be expected just due to drift 
(Evans and Gustafsson 2017), but the underlying genomic changes remain 
unknown. 

Studying the genomic basis to variation in forehead patch size is complicated by 
the highly polygenic nature of this trait (Kardos et al. 2016, Fig. 4), although this is 
expected for a condition-dependent sexually selected trait (Rowe and Houle 1996). 
By contrast, beak morphology in Darwin’s finches seems to be controlled by genes 
with somewhat larger effect sizes (Fig. 4). Let us return to the example of medium 
ground finches (Geospiza fortis) living on the island of Daphne Major that experi¬ 
enced two serious periods of drought outlined in the introduction. A recent study 
based on blood samples from individuals collected before and after a drought shows 
that a specific gene associated with beak morphology, HMGA2, went through large 
changes in allele frequency following the drought episode (Lamichhaney et al. 
2016b). Individuals who survived drought tended to carry a different allele than 
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Fig. 4 Whole genome scans to identify loci controlling forehead patch size variation in (a) collared 
flycatchers (from Kardos et al. 2016) and (b) beak size variation between groups of Darwin’s 
finches with blunt (G. magnirostris and G. conirostris ) versus pointed beaks (G. conirostris and 
G. difficilis ), from Lamichhaney et al. (2015) 

those who succumbed. This study meets the first three conditions needed to demon¬ 
strate evolution due to climatic conditions; beak size is a heritable trait (Keller et al. 
2001), under selection, with the changes in the phenotype associated with underlying 
genetic changes (Lamichhaney et al. 2016b). A model demonstrating that the 
changes in HMGA2 allele frequency were not due to drift would further strengthen 
the proof for adaptive evolutionary change at the gene level. 

We are not aware of any individual-based studies, aside from the aforementioned 
example, that have passed all of these conditions to demonstrate evolution in 
response to ecological factors, but given the increasing resolution of genomic data 
and larger-scale genotyping and sequencing efforts, it seems likely that we will see 
more demonstrations of adaptive microevolutionary change in the near future. 


6 Summary 

The early pioneering work of Kluijver, Lack, Nice and von Haartman laid the 
foundations for long-term studies on birds and has provided an immensely valuable 
resource for future generations of avian biologists. The longitudinal data on parents 
and their offspring over many generations have made quantitative genetic studies, 
and now more recently genomic studies to map QTL and examine selection at the 
QTL, feasible. During the last 20 years, we have seen a tremendous transformation 
in the genetic resources that are available to the avian biologist, something that has 
led to revision of many established ideas about avian mating systems, phylogenetic 
relationships and demographic histories being overturned. Avian population studies 
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have been important contributors to this development and provided some of the most 
well-known examples of natural selection and evolution in the wild, like the case 
with Darwin’s finches (Grant and Grant 1995; Lamichhaney et al. 2015). 

A long history of quantitative genetic studies on birds have clearly demonstrated 
a genetic basis to nearly all avian traits and that natural and sexual selection is 
ubiquitous (reviewed in Merila and Sheldon 2001; Postma and Charmantier 2007). 
More recently, genomic approaches have confirmed the genetic basis of avian traits 
using marker-derived relatedness matrices in a mixed model framework 
(e.g. Santure et al. 2013, 2015; Husby et al. 2015; Kardos et al. 2016; Silva et al. 
2017). Moreover, some of the major insights provided by recent genomic work on 
birds are that most traits are controlled by many genes of individually small effect 
(Santure et al. 2015; Husby et al. 2015; Silva et al. 2017; Kardos et al. 2016). Despite 
some early indications to the contrary from linkage studies (e.g. Tarka et al. 2010), 
the polygenic nature of most traits will come as no surprise to evolutionary 
geneticists familiar with gene mapping results in lab organisms, such as Drosophila 
(Mackay 2001). This finding has important implications for future gene mapping 
studies in birds because small effect sizes, combined with rapid decay in LD 
(Poelstra et al. 2013; Kardos et al. 2016), mean that thousands of individuals and 
hundred of thousands of markers are needed to identify QTL. While such studies 
seemed fanciful just a few years back, they are already in the progress, and this opens 
exciting new possibilities to address how populations adapt to changing environ¬ 
mental conditions in a more realistic manner, i.e. through polygenic adaptation. 
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Abstract 

The world’s birds are in trouble, and scientific research, including genetic and 
genomic methods, can play an important role in understanding and mitigating 
these problems. In this review, we summarize several ways that the concepts and 
methods of genomics can help with bird conservation and how the dramatically 
increasing power and decreasing costs of these methods may allow an even 
greater role in the future. We assess six primary, not exhaustive, and not mutually 
exclusive research areas, including avian forensics, captive management, infec¬ 
tious disease and vector interactions, metagenomic and microbiome applications, 
systematics and the definition of conservation units, and the genomics of adapta¬ 
tion. We conclude that the uses of genomics to identify, understand, and in some 
cases reduce anthropogenic impacts on bird populations are well underway. And 
the future holds great promise that developments in our understanding of avian 
genomes and tools to modify them will play an increasingly important role in 
future attempts to alleviate these impacts. 
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1 Introduction 

Like the proverbial canary in the coal mine, birds serve as important model systems 
and umbrella taxa for conservation of ecosystems. Since 1500 as many as 183 species 
of birds have gone extinct, and currently about 23% of the nearly 11,000 species of 
birds are considered endangered, vulnerable, or threatened, and their decline 
continues at an alarming pace (Birdlife International 2018). While there are many 
scientific fields involved in understanding and mitigating these threats to birds, 
genomics and advanced genetic methods have much to contribute to the conser¬ 
vation of birds. Here we summarize several of the recent advances in genomic 
applications in conservation and provide examples of these approaches from the 
literature. We also provide a prospective view of novel future applications and 
approaches, given the rapidly increasing power arising from high-throughput 
sequencing, synthetic biology technologies, and increased bioinformatics power. 

Genomics in the Broad Sense Here we consider genomics in the broad sense, that 
is, not just sequencing and characterizing full genomes but using genomic and 
advanced (e.g., next-generation) sequencing methods that can provide useful data 
for a wide range of applications in conservation biology. For example, 
transcriptomics methods can be used to identify avian genes expressed in response 
to environmental stressors such as disease, climate change, or pollution (Jax et al. 
2018). In addition, use of metagenomic and microbiome methods can elucidate how 
avian symbionts influence survival, life history, and population dynamics and 
facilitate characterization of diets and dietary shifts (Trevelline et al. 2018). Genomic 
methods can also be used in such classic applications of conservation genetics as 
assays of genetic variation in small and captive populations (Harrisson et al. 2014), 
determination of species and evolutionary significant units (Robertson et al. 2014; 
Oyler-McCance et al. 2015; Peters et al. 2016; Ottenburghs 2019), inbreeding levels 
(Li et al. 2014), movements (Kawakami et al. 2014), population sizes 
(Nadachowska-Brzyska et al. 2015), and the presence of disease (Sun et al. 2015). 
And layered upon these types of issues in conservation genetics, environmental and 
noninvasive DNA approaches have proven useful for research on endangered and 
difficult-to-study birds. Below we survey several of the areas of avian conservation 
biology for which we believe genomic methods can and will play an important role. 
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Genomics Methods These can involve assessment of whole genomes or just 
sizable numbers of nuclear and plastid loci sampled from across the genome 
(Lemer and Fleischer 2010; Toews et al. 2015). Early applications of genomics 
have primarily involved reduced representation sequencing, in which short 
sequences (e.g., SNPs) are sampled from throughout the genome to represent 
genomic or evolutionary processes occurring across the genome (Huang et al. 
2013; Kraus et al. 2011, 2012). Genomic methods can include whole-genome 
“shotgun” sequencing, transcriptome characterization (usually via RNA-seq 
methods; Wang et al. 2009), and reduced representation methods such as restriction 
site-associated DNA (RADs, Baird et al. 2008), ultra-conserved elements (UCEs, 
McCormack et al. 2012), exomes, and other sorts of capture approaches. They also 
often include metagenomic (Riesenfeld et al. 2004) and metabarcoding (amplicon) 
methods (Taberlet et al. 2012) such as those used in some eDNA (environmental 
DNA or DNA sampled from an environmental substrate rather than a single organ¬ 
ism) and microbiome assessments. 


2 Applications of Genomics to Avian Conservation 

2.1 Forensics 

One of the seemingly simplest uses of genetic and genomic data is in avian forensic 
analysis. Forensics involves using tools to make identifications that are relevant to 
the solution of a public issue (such as identification of a bird involved in an airplane 
bird strike) or a crime (such as poaching or illegal trade). The classical approaches 
involve species identifications using comparisons of feather morphology or anat¬ 
omy, but more recently simple DNA barcoding has become standard (Dove et al. 
2008, 2009; Johnson 2011). In addition, microsatellite or simple tandem repeat 
(STR) markers, which usually require higher-quality DNA samples, have been the 
standard for identifying individuals within a species (e.g., Dawnay et al. 2009; 
Bielikova et al. 2010; Jan and Fumagalli 2016; Coetzer et al. 2016) and population 
of origin (Weissensteiner and Suh 2019). Usually these methods are used in identifi¬ 
cation cases where individual birds have been the focus of poaching or theft. More 
sophisticated DNA analyses have now increased resolution of population- and 
individual-level identifications (Iyenegar 2014; Arenas et al. 2017), and next- 
generation sequencing and genomic methods hold great promise for making more 
accurate and inexpensive identifications. 

In addition, the use of genomic methods in forensics includes identification of 
species and of individuals from remnant and often degraded samples, which are not 
usually considered “genome quality” and thus are amenable to only some next- 
generation sequencing methods (Yang et al. 2014). In the past, simple PCR of 
mtDNA and microsatellites have been the primary approaches in avian forensics 
(e.g., Dove et al. 2008; Coetzer et al. 2017), but next-generation methods may have a 
greater ability to contribute to forensic analyses. This is because very small 
fragments of DNA are all that are needed via shotgun or hybridization capture 
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methods to characterize a SNP and many dozens to thousands of SNPs can be 
assayed in a single analysis, assuming adequate coverage to call each SNP. 

Recently, next-generation sequencing methods have been used to identify larger 
numbers of avian microsatellites than can be obtained from standard methods, for 
example, for use in parentage analyses of Bornean birds (Kaiser et al. 2015) and 
forensics of potentially criminally obtained endangered New World parrots (Jan and 
Fumagalli 2016). In this process, many thousands or even millions of random 
sequences are obtained from shotgun sequencing of one or more individuals on a 
high-throughput sequencer. Sequence platforms must be capable of producing 
sequences of sufficient length to include sequence repeats capable of generating 
variation (>10 dinucleotide to pentanucleotide repeats), plus adequate priming sites 
(usually >150 bp in total length). From this approach, many hundreds or thousands 
of loci can be identified that can then be developed and optimized for further use. 

Another interesting application has been to use PCR with mammalian markers on 
avian specimens to assess the identity of predators of birds such as sage grouse from 
mammalian saliva or other remnants left behind (Hopken et al. 2016). And micro- 
biomes may also be useful for identifying population of origin or perhaps may even 
have signatures useful to identify individuals (Arenas et al. 2017). 

Perhaps the biggest problem with forensics as applied to birds is more in terms of 
policy, in that for criminal investigations specific protocols and constrained chains of 
custody are required, and development, troubleshooting, and application of 
novel methods tend to evolve very slowly in these communities because of 
legal constraints. 


2.2 Captive Population Management 

Genomic techniques are ideally suited to advance the goals of captive population 
management. Captive population managers seek to preserve the range of genetic 
diversity of a species in order to preserve adaptive variations and adaptive potential. 
Managers also try to avoid adaptation to captivity or other forms of selection which 
might influence the allele frequencies of the population. Genomic data can be 
applied to increase the accuracy and precision of current methods for maintaining 
overall genetic diversity and has the potential to enable new methods/practices to 
avoid adaptation to captivity and preserve adaptive variants. 

2.2.1 Maintaining Genetic Diversity 

One of the main goals of captive population management in zoos and other 
institutions is to enable conservation in situ, by providing animals for reintroduction 
or for release into declining wild populations (Ralls and Ballou 2013). Managers 
may also maintain viable captive populations for display/public education or 
research (Ralls and Ballou 2013). Typically, captive management includes manag¬ 
ing both the demographic and genetic profile of the population as a whole to preserve 
diversity and ensure that the population will grow or retain its size. The goal of 
maintaining a stable or growing captive population is generally met through the use 
of pedigree information and planned pairings (Ballou and Lacy 1995). Each wild- 
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caught founding member of the captive population is assumed to be unrelated. The 
relationships (or kinship) between each animal and all the others in the population 
are determined, and a breeding plan is devised to minimize the average mean kinship 
within the population. This has been shown to be the optimal strategy to preserve the 
original genetic diversity of the founders (Ballou and Lacy 1995). Care is also taken 
to manage the demographic profile of the population to ensure that there will be 
enough breeders of both sexes in the future to maintain the population. 

However, most captive populations do not meet the assumption of unrelated 
founders. For example, multiple individuals from the same clutch may inadvertently 
be collected, and multiple members of the same natural population are often selected 
as founders without knowing their relatedness (Bergner et al. 2014). Genetic data 
can be used to identify related individuals in some cases, but this is more difficult in 
small populations that have been subject to multiple generations of inbreeding. 
Historically, genetic markers could be limited in their ability to provide sufficient 
resolution to resolve relationships (e.g., allozymes), or large numbers of markers 
were required to be developed de novo for each species due to low genetic variability 
(microsatellites). Genome-wide data allows calculation of kinships between both 
larger numbers of and more related individuals and can be used to resolve or revise 
pedigrees (Bergner et al. 2014). Genome-wide sequencing in California condors, for 
example, has resulted in revised breeding strategies by revealing unknown related¬ 
ness structures between founders (Romanov et al. 2009; Ryder et al. 2016). Simi¬ 
larly, the ability to easily generate large number of microsatellite markers using next- 
generation sequencing methods (as described above) will also allow greater resolu¬ 
tion of pedigrees and relatedness within captive populations. 

For captive populations where collection of new founders is still possible, the 
ability to compare genomic diversity in captive and wild populations is invaluable 
(e.g., Rascha et al. 2016). These data would help conservation managers preserve as 
much wild genetic diversity as possible (Fig. 1, top right) by allowing them to target 
underrepresented genetic profiles for collection or translocation (Mounce et al. 
2015). Even if no captive genomic data exist, studies of wild populations that 
identify genomic diversity or population structure can greatly benefit collection 
efforts (Mounce et al. 2015). Additionally, captive managers benefit from studies 
of wild population genomics as they learn about potential subspecies or other groups 
that should possibly be managed separately. 

Birds that live and breed in groups, like flamingoes, also present a problem for 
traditional methods, as individual parents cannot be tracked to create accurate pedi¬ 
grees. Parentage assignments of individuals can be done with traditional genetics 
techniques, but determining the relatedness between individuals in a given group or 
between groups has been more challenging (Smith 2010). Large numbers of 
genome-wide markers could be used to improve group management by allowing 
comparison of individuals or subgroups without needing to reconstruct a compli¬ 
cated pedigree using traditional genetic methods. 
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Fig. 1 Conceptual figure depicting a few ways in which genomics contributes to avian conserva¬ 
tion. Clockwise from top right: (a) Individuals (rows) from a given species can be evaluated for 
similarity to a target sample at loci across the genome (more blue = higher similarity, more 
orange = low similarity, gray = no data) in forensic analysis [Section 1]. (b) Kinship matrix for 
individuals within a species, where green shows low relatedness (i.e., potential breeding pairs) and 
purple represents high relatedness (i.e., relatives or other individuals to avoid breeding in captive 
propagation) [Section 2]. (c) Differential expression of immune function genes can be analyzed in 
resistant and susceptible individuals to determine which genes control infection (greater than 
1 = over-expression, less than 1 = under-expression) [Section 3]. (d) The distribution of bacterial 
OTUs in decaying chicken over time; each polygon represents an OTU, where early time points 
(0-2 days) indicate shortly after death and later time points are indicative of carrion [Section 4]. (e) 
Three species of variably threatened Hawaiian honeycreepers, the Hawai’i ‘amakihi 
(Chlorodrepanis virens, green), ‘i’iwi (Drepanis coccinea, curved bill), and ‘apapane (Himatione 
sanguinea, red with black bill), (f) Clustering of individuals (rows) across genomic loci (columns) 
can inform scientists of the loci involved in species divergence and reveal cryptic species [- 
Section 5]. (g) Outlier analyses can reveal loci showing significantly higher divergence (blue 
circles) than the rest of the genome, indicating potential adaptation to local climatic regimes or 
other conditions [Section 6] 











































The Contribution of Genomics to Bird Conservation 


301 


2.2.2 Adaptive Variation 

As more is understood about adaptive variation, genomic data could potentially be 
used to help preserve adaptive diversity beyond the mean-kinship method. Selection 
based on traits in conservation breeding has potential downfalls, such as (1) we do 
not know which traits will be adaptive when the captive-bred animals are released, 
(2) interactions between multiple loci are not well-understood for traits of interest, 
and (3) selection for some traits could reduce other genomic variation. Genomic 
studies will allow deeper investigation of some of these concerns. 

With the advent of the use of genomics in evaluating adaptive variation, captive 
population managers will have to develop methods for integrating that information 
into management decisions. A first step in this direction, which is already being 
undertaken, is understanding adaptation to captivity. This occurs when selection 
pressures are unintentionally applied in the captive environment and result in genetic 
shifts in the population which may then be maladaptive upon release of individuals 
to the wild. Currently, adaptation to captivity has been identified in several species 
(notably fish in aquaculture; Makinen et al. 2015), but this work remains to be done 
in birds. Learning which genomic features are susceptible to selection in the 
captive environment will help managers reduce those effects while maintaining 
overall genetic diversity. 

2.2.3 Future Directions 

Using genomic tools for captive management is still an emerging practice, and many 
challenges have yet to be addressed. A few avian species, such as California condors 
and whooping cranes, are already being managed using data from genomic studies 
(Ryder et al. 2016). These methods are becoming more accessible to the conser¬ 
vation community, and several groups are developing consistent workflows for 
marker development and genotyping. However, the following questions still need 
to be addressed: (1) how many markers are needed to resolve pedigrees and answer 
kinship questions? (2) If some of the pedigree is known, how can an optimal set of 
individuals be chosen for genotyping? (3) In which cases is it more efficient to use 
genomic data to reconstruct a pedigree vs. calculate genomic kinships? As genomic 
tools are applied to the captive management of more species, answers to these 
questions should emerge. 


2.3 Avian Infectious Disease 

Genomic tools have enhanced our ability to understand avian disease dynamics, 
coevolution between avian hosts and pathogens, and the evolution of pathogen 
virulence and host resistance and tolerance. These technologies have enabled 
scientists to push the boundaries of knowledge in infectious disease research. 
Although many of the major questions in avian disease ecology and evolution 
remain the same, our ability to answer them in more nuanced ways has improved. 
Biologists seek to explain patterns of host susceptibility, pathogen virulence, parasite 
specificity, coevolution, and parasite transmission dynamics across complex 
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landscapes and host communities. Prior to the genomics era, studies of avian disease 
were focused on single “candidate” host genes thought to play a role in resisting 
infection (e.g., Jarvi et al. 2004), single parasite genes posited to confer virulence, 
reliance on parasite morphology or single genes to describe taxonomy 
(Krizanauskiene et al. 2006), and phenotypic disease pathology in the host. 

These approaches have led to significant advances in our understanding of host 
and pathogen ecology and evolution; however, each poses barriers to a complete 
understanding of host-pathogen dynamics. Prior to the advent of genomic 
technologies, avian disease biology experienced several primary limitations: (1) vari¬ 
ation in virulence among parasite strains was cryptic; (2) disease pathology results 
from the interaction between parasite virulence and host susceptibility and thus 
cannot be used to infer exclusively host or pathogen processes; (3) alleles conferring 
both pathogen virulence and host resistance/tolerance were usually unknown; (4) a 
lack of variation in commonly sequenced genes hindered comparative study; and 
(5) individual species contributions to disease transmission were difficult to quan¬ 
tify. With genomic tools, avian biologists are on the cusp of making large 
breakthroughs in these and other areas. Here, we outline a few outstanding questions 
in avian disease biology and highlight some key studies that have capitalized on 
genomic approaches. 

2.3.1 Host Susceptibility and Pathogen Virulence 
Host Susceptibility 

Single-gene studies, primarily focused on major histocompatibility complex (MHC, 
Weissensteiner and Suh 2019) genes and other genes known a priori to be part of 
normal immune system processes, have resulted in slow discovery of novel disease- 
related genes. In turn, this has biased our understanding of disease response such that 
non-MHC genes have largely been ignored. Negative results have likely gone 
unpublished, leading to an artificial inflation of the generality of importance of 
very few genes. Until recently, scientists had little ability to discover novel genes 
invoked in the infection process, as this work was economically inefficient. The 
single-gene approach is still of great use in systems where particular genes are 
known to be involved in the infection process (Chapman et al. 2016). However, 
there is growing evidence that a large number of host genes may be under selection 
from pathogens (Cheng et al. 2015; Wang et al. 2016). Of these, only some are 
conserved across populations (Connell et al. 2012; Vide vail et al. 2015; Cassin- 
Sackett et al. 2018); furthermore, genetic variation at these loci can be shaped by 
both selection (Raven et al. 2017) and drift (Gonzalez-Quevedo et al. 2015). Thus, a 
priori hypotheses of relevant genes will resolve only a limited picture of the 
evolutionary response to infection. Moreover, with the global increase in wildlife 
diseases, many of which are introduced, pathogen invasion of novel hosts and host 
response to novel pathogens may invoke genes that co-opt cellular machinery for 
other purposes. A classical example of such a solution to disease is the sickling of red 
blood cells that prevents replication of Plasmodium falciparum in humans. The gene 
causing this abnormal hemoglobin likely would have been overlooked with a target 
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gene approach. Recent studies have assessed the potential for using reciprocal 
translocations as a means to reintroduce genetic diversity at disease resistance 
genes into wild populations (Grueber et al. 2017); this type of effort could be 
aided by a more complete understanding of which disease-related genes play a 
role in each system (Fig. 1, bottom right). 

Pathogen Virulence 

Traditionally, pathogen virulence in bacteria has been assumed to arise as a result of 
virulence genes contained on plasmids. This is often the case, but genomic tools 
have allowed us to understand that virulence is often conferred by multiple genes 
acting in tandem, not only on plasmids but also localized on “pathogenicity islands” 
(Pilo et al. 2005) or elsewhere in the genome. Often, virulence genes control some 
other cellular function such as metabolism and lead to virulence as a by-product (Pilo 
et al. 2005; Szczepanek et al. 2010; Tulman et al. 2012). Comparative genomics has 
been used in studies of commensal and pathogenic bacterial strains to identify 
unique genes as putative causes of virulence. For example, genomic tools have 
enabled scientists to subtract the genomes of nonpathogenic strains of Escherichia 
coli from the genomes of avian pathogenic strains, leading to the confirmation of 
existing virulence plasmids as well as identifying novel virulence factors 
(Schouler et al. 2004). The subtractive genomic approach can also be used to 
subtract pathogen genomes from host genomes in a single host tissue. 

The identification of virulence factors in non-bacterial pathogens is even more 
complex. Mechanisms of host red blood cell invasion, for instance, are not 
conserved across Plasmodium species: diversification of particular host invasion 
mechanisms may have occurred only in mammal-infecting Plasmodium lineages and 
thus lend limited insight into invasion of avian blood cells by P. gallinaceum 
(Lauron et al. 2015) or P. relictum. Candidate gene studies of Plasmodium infection 
have typically focused on a small number of putative virulence genes such as the 
merozoite surface protein 1 (mspl), as one of the first steps in host cell invasion 
(Hellgren et al. 2013). However, transcriptome sequencing of P. gallinaceum 
illuminated additional parasite genes necessary for avian infection and revealed 
evidence of diversifying selection as a result of the host immune response 
(Lauron et al. 2014). As in host resistance/tolerance, parasite virulence may likewise 
be conferred by multiple genes, especially in systems where a coevolutionary arms 
race is occurring or where multiple host species exist that differ in susceptibility. For 
instance, attenuation of a Mexican West Nile virus lineage was conferred by both 
pre-membrane and envelope proteins, while either gene alone did not confer attenu¬ 
ation (Langevin et al. 2011). 

In addition to characterization of genes involved in virulence, genomics techno¬ 
logies facilitate pathogen discovery when the cause of wildlife disease outbreaks is 
unknown. Some diseases have unknown etiologic agents: a panviral microarray 
led to discovery of an avian bomavirus as the previously uncharacterized causative 
agent of proventricular dilatation disease and mortality events in both the 
United States and Israel (Kistler et al. 2008). Genomics tools allow similar promise 
for pathogen discovery (Borner and Burmester 2017). 
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2.3.2 Host Specificity and Coevolution 

Accurate description of host specificity relies on correct delineation of host and 
parasite species. Historically, parasite taxonomy was studied using parasite morpho¬ 
logy, parasite life cycle, host species, or host pathology. More recently, mtDNA 
haplotypes have been used to describe taxonomic relationships. Both of these 
approaches have yielded great insights into pathogen specificity and coevolutionary 
dynamics, yet they have often led to spurious conclusions resulting from incomplete 
information. For instance, malaria parasites (including Plasmodium , Leucocytozoon , 
and Haemoproteus species) were traditionally classified based on their morpho¬ 
logical characteristics and host species, leading to inferences of high host specificity. 
However, Beadell et al. (2006) and Martinsen et al. (2006) demonstrated that 
morphospecies are not always linked to particular genotypes, suggesting potentially 
widespread cryptic diversity or morphological convergence among haemosporidian 
species. The use of mitochondrial genes modified our understanding of parasite 
taxonomy and demonstrated wide variation in host specificity (Beadell et al. 2006, 
2009) in this group, as dozens of cryptic lineages were discovered (e.g., Palinauskas 
et al. 2015; Nilsson et al. 2016). As more host species have been sampled and similar 
lineages detected across taxa, we have come to understand that many blood parasites 
display high host generality (Iezhova et al., 2011; Nilsson et al. 2016). Nonetheless, 
haemosporidians display greater levels of host specificity than other pathogens 
such as West Nile virus, in which lineages correspond to geography rather than 
host species (May et al. 2011). 

In addition to their range of host specificity, pathogens also exhibit varying 
degrees of vector specificity that was largely undetectable prior to the genetics era. 
Martinsen et al. (2008) were among the first to demonstrate parasite-vector specific¬ 
ity: they found that major clades of malaria parasites were associated with shifts into 
new vectors, which then permitted weaker associations with major vertebrate host 
clades. While some parasites are capable of infecting multiple vector families, others 
may be specific to particular genotypes. For instance, the existing population of 
introduced Culex quinquefasciatus in Hawaii (a mix of North American and mostly 
Austral-Pacific lineages; Fonseca et al. 2006) successfully transmits the local strain 
of Plasmodium relictum , GRW4, to local avian species, whereas the North American 
lineage of Cx. quinquefasciatus was experimentally shown to be refractory to GRW4 
(Fonseca, Fleischer, unpublished). The identification of these two strains was 
facilitated by genetic studies, and genome-wide scans of the distinct vector lineages 
would lend insight into the functional differences between genotypes. Wilkinson 
et al. (2014) used a combined 16S and targeted gene approach to reveal that 
pathogenic bacteria display vector specificity in seabird ticks [although this may 
be the case only for certain tick symbionts, as Duron et al. (2016) found minimal 
evidence of host specificity and a high degree of horizontal transfer]; this combined 
approach adds power to avian disease systems for which genomic resources have yet 
to be developed. Indeed, next-generation sequencing technologies can be used to 
develop Sanger sequencing assays (Hellgren et al. 2013). 

In complement to pathogen vector specificity, host preference of vectors has been 
illuminated by bloodmeal analysis (Kilpatrick et al. 2006; Savage et al. 2007), 
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reinforcing the idea that hosts are not chosen from a community randomly (Pauli 
et al. 2012). Genomics offers the potential for addressing specificity at various levels 
using high-throughput analysis of bloodmeals in both hosts and vectors. 

Cryptic variation in avian host species (Stervander et al. 2016) could lead to 
erroneous conclusions such as overestimating host specificity (e.g., one parasite 
lineage in two cryptic host species) or the diversity of parasites infecting host species 
(e.g., two parasite lineage in two cryptic host species). In addition, host-switching 
events would not be detected when parasite lineages infected a new, cryptic host 
taxon. In the case of multiple parasite lineages infecting cryptic host species, an 
incomplete understanding of host taxonomy could beget the false inference of stable 
parasite species coexistence when in fact two parasite lineages competitively 
exclude each other in different host species (specialization on cryptic host taxa). 
Using genomic data to inform avian and parasite taxonomy solves many of these 
problems; in fact, whole-genome information resolves many taxonomic uncer¬ 
tainties (Jarvis 2014; Hug et al. 2016; Ottenburghs 2019) and should enable a better 
understanding of host specificity, host and vector preference, and parasite diversity. 

With the ever-increasing amount of genomic data available on parasite lineages, 
more variation among strains will be detected, posing the problem of how to classify 
parasite species. Historically, species were delineated based on the existence of one 
or a few nucleotide differences in mitochondrial genes, but it is now trivial to 
discover dozens of novel variants within a single parasite species, lending the 
problem of delineation based on molecular information. Thus, biologists studying 
parasites will have to find common ground on which to define and delineate taxa 
(Outlaw and Ricklefs 2014) to facilitate comparative studies of host specificity and 
coevolution. It is now possible to identify specific parasite genes that are differ¬ 
entially expressed in different host species (Videvall et al. 2017), allowing for a 
functional understanding of host specificity that may contribute to our understanding 
of parasite taxonomy. 

2.3.3 Phylogeography, Population Genetics, and Within-Host 
Evolution 

After some limitations of mtDNA for resolving phylogeographic patterns became 
evident (e.g., the introgression of entire mitochondrial genomes (Bellemain et al. 
2008) can lead to discordance between mtDNA and species trees), scientists started 
leveraging virulence genes or host-specific parasite attack genes to describe host- 
parasite relationships, host-switching events, parasite phylogeny, and parasite phylo¬ 
geography (Taubenberger et al. 2005; Hellgren et al. 2013, 2014; Harkins and Stone 
2015). These single-gene studies have begun to clarify some phylogenetic relation¬ 
ships but have obscured others (Harkins and Stone 2015). 

In contrast to drawbacks with morphological approaches, single-gene studies 
confront the problem of poor detection of mixed infections (Valkiunas et al. 
2006). The cost efficiency of multi-gene Sanger sequencing is declining, particularly 
in systems where suitable genes for taxonomy, phylogeography, or functional 
ecology and evolution have not yet been identified. However, single-gene studies 
can overlook functional diversity at loci not typically used for taxonomy. These 
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drawbacks have the potential to be resolved with parasite genomics (Nilsson et al. 
2016). Genomic tools can reveal cryptic diversity that can be leveraged for fine-scale 
studies of parasite phylogeography and within-host evolution. 

Introduced pathogens can be traced using genomics to identify the origin of 
introduction. For instance, whole-genome sequencing of West Nile virus strains in 
the western Mediterranean revealed that the present diversity stems from a single 
introduction followed by local maintenance in the region (Sotelo et al. 2011). This 
study also revealed a meaningful insight about pathogen evolution: a single mutation 
linked to virulence (Brault et al. 2007) occurred multiple times in distinct geographic 
regions during the evolutionary history of West Nile virus (Sotelo et al. 2011). The 
ability to trace pathogen introduction history both to a particular host species and 
geographical location enables quarantine or vector control measures to be put in 
place to conserve vulnerable avian species. For example, Usutu virus strains from 
various locations in Africa were subjected to whole-genome sequencing, and the 
country of origin of European introduced Usutu virus was inferred due to high 
similarity between a native strain from Senegal and introduced strains in Europe 
(Nikolay et al. 2013). 

2.3.4 Insights from Genomics 

In summary, many broad evolutionary patterns have been revealed by the avail¬ 
ability of genomic data, including several major insights in avian disease. Table 1 
presents a selection of several major themes in host and parasite biology for which 
our understanding has changed as a result of genomic advances. 

2.3.5 Limitations 

One area in avian disease research that still lags behind, as in other fields in avian 
biology, is verification of gene function in non-model organisms. Experimental 
infection studies are often impractical or unethical in non-model species, particularly 
those of conservation concern. As a result, most of the data on genes underlying 
pathogen and host phenotypes are correlational. Comparative genomics with well- 
annotated genomes, in conjunction with experimental challenges of widespread 
related species, will assuage this shortcoming. In addition, improvements in gene 
prediction, genome annotation, and characterization of noncoding DNA will facili¬ 
tate prediction of how genetic changes will interface with different genomic 
backgrounds. These improvements can be made not only via experimental gene 
knockout studies and germ cell editing to assess gene function (Park et al. 2014) but 
with a better understanding of genome evolution. Synteny is higher in avian 
genomes than in other vertebrate taxa (Zhang et al. 2014), enabling these types of 
advances more readily in birds. 

A limitation that is biological and not technical emerges from recent findings in 
genomics studies that resistance is conferred by multiple, often unpredictable loci 
that vary across populations (Connell et al. 2012; Videvall et al. 2015; Cassin- 
Sackett et al. 2018). This suggests a continual need for species-specific studies of 
host and parasite evolution. However, this shortcoming is ameliorated by genomics, 
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Table 1 Summary of applications of genomics to major themes in the study of avian disease 


Topic 

Previous 

understanding 

Current understanding 

Relevant contributions 

Parasite 

taxonomy 

Morphology- 

or 

cytochrome 

b-based 

A large degree of cryptic 
diversity exists, some of 
which may have functional 
consequences 

Beadell et al. (2006), 
Martinsen et al. (2006), 
Palinauskas et al. (2015) and 
Nilsson et al. (2016) 

Host 

specificity 

Most 

parasites are 
host-specific 

There is wide variation in the 
degree of host specificity; 
some parasites exhibit vector 
rather than host specificity 

Fonseca et al. (2006), 
Martinsen et al. (2008), 
Iezhova et al. (2011), May 
et al. (2011), Wilkinson et al. 
(2014), Nilsson et al. (2016), 
Duron et al. (2016) and 
Videvall et al. (2017) 

Mechanisms 
of host 
resistance/ 
tolerance 

Primarily 

MHC-based 

Many additional immune 
genes are involved (and some 
genes not in the immune 
pathway) 

Cheng et al. (2015), Wang 
et al. (2016) and Cassin- 
Sackett et al. (2018) 

Evolution of 
MHC 

Diversifying 

selection 

Diversifying selection, 
balancing selection, neutral 
evolution 

Grueber et al. (2013), Raven 
et al. (2017) and Gonzalez- 
Quevedo et al. (2015) 

Pathogen 

virulence 

Decreases 

over 

evolutionary 

time 

Increases, decreases, or 
remains stable over 
evolutionary time 

Szczepanek et al. (2010), 
Tulman et al. (2012), Murray 
et al. (2017a, b) and Fan et al. 
(2017) 


which enables whole-genome and transcriptome data to be gathered from increas¬ 
ingly smaller biological samples. As whole-genome data are acquired with increas¬ 
ing efficiency and decreasing cost, this limitation is expected to decline. 

Although genomic technology is changing rapidly, it remains difficult to isolate 
parasite DNA from host tissue due to the lower cellular representation of parasite 
DNA. Nonetheless, some promising approaches have proven successful in recent 
parasite studies: Plasmodium relictum was isolated from blood smears using lasers 
and subsequently subjected to whole-genome sequencing (Lutz et al. 2016). This 
technology offers the potential to acquire genomic information from archived blood 
smears and other valuable sources. 

2.3.6 Future Directions 

Future genomics work on parasites and vectors of avian disease are needed to 
characterize the ecological and evolutionary linkages in vector-borne diseases. 
These studies will illuminate the mechanisms underlying vector adaptation to 
parasite lineages as well as parasite influence on vector behavior (e.g., how Plasmo¬ 
dium relictum invokes a feeding preference of infected birds on uninfected 
mosquitoes; Comet et al. 2013) and therefore disease dynamics. 

The unprecedented ability to obtain whole-genome sequence data rapidly and 
with increasing cost efficiency paves the way for effective conservation actions to 
take place in near real time. In the coming years, whole-genome sequencing or 
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reduced representation screening at disease resistance loci can be carried out prior to 
management actions such as captive breeding or relocation, effectively increasing 
the disease resistance or tolerance of vulnerable host populations. 

Exciting opportunities for conservation have emerged with the discovery of 
CRISPR (clustered regularly interspaced short palindromic repeats) defense systems 
in bacteria and archaea and the associated advent of genome-editing technologies 
(Jinek et al. 2012). With CRISPR-associated (Cas) systems, genetic information can 
be permanently edited, which can be leveraged in disease systems by modifying 
pathogen virulence, vector competence, or host tolerance. Indeed, disease systems 
have been some of the first real-world applications of this tool (Kistler et al. 2015; 
Hammond et al. 2016). Moreover, the technology allows for multiple loci to be 
edited simultaneously (Jao et al. 2013), which would simplify genome editing in the 
case of multilocus tolerance or virulence. Recent developments facilitate CRISPR- 
Cas gene editing in non-model avian species (Cooper et al. 2017, 2018). As 
pathogens are increasingly moved around the globe and naive host populations are 
exposed to novel pathogens, CRISPR could allow for emergency management of 
critically endangered populations when there is not sufficient time for captive 
breeding or other conservation strategies. 


2.4 Avian Conservation Genomics and Heterogeneous Samples: 
Metagenomics and Metabarcoding 

In the traditional sense, genomics has involved the discrete analysis of one or a few 
samples, usually collected from individuals with known taxonomy or other 
characteristics of particular interest. Recently, however, there has been increasing 
interest in the simultaneous analysis of mixed samples that represent the “commu¬ 
nity” (Xu 2006). These heterogeneous community samples can include, for example, 
feces, microbiota, or mixtures of the remains of several individuals/species. High- 
throughput sequencing of these samples provides the power and sequencing depth 
required to obtain genomic data from those individuals, without the necessity of 
time-consuming approaches such as molecular cloning to first separate them by 
species or genotype. These data can be used for a wide variety of purposes in avian 
conservation, which can include (but are not limited to) biodiversity detection, 
investigating bird diets (which may limit aspects of the annual cycle such as breeding 
or survival during migration), detecting endangered birds as dietary items of 
predators, and understanding how avian microbiota contribute toward adaptation 
and health (see below). 

In the strictest definition, metagenomics involves shotgun (i.e., theoretically 
random) sequencing of DNA—potentially to the scale of whole genomes—directly 
from a heterogeneous sample (Taberlet et al. 2012). This approach can be used to 
characterize the (often microbial) community in terms of taxa present and investigate 
the content and function of the genes sequenced. Similar approaches analyzing RNA 
instead of DNA can investigate further questions such as how gene expression 
changes in interacting communities under different conditions or states (Fierer 
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et al. 2012; Poretsky et al. 2009). In a related technique, often termed metabarcoding, 
DNA from an environmental sample is first PCR amplified for a particular region 
(e.g., 16S ribosomal RNA or COI, the cytochrome oxidase I gene), and then these 
amplicons are sequenced on a high-throughput platform, usually with the aim of 
deeply characterizing the community composition (Taberlet et al. 2012). 

Below we describe some recent and potential applications of these two general 
approaches with birds, some strengths and limitations of each, and discuss possible 
future directions for the field. 

2.4.1 Metabarcoding 

Perhaps one of the greatest potential areas of impact for metabarcoding in avian 
conservation is through biodiversity monitoring and discovery. Theoretically, any 
sample mixture could be screened to identify the presence of avian species. This type 
of work is routinely taking place for mammals and aquatic animals (e.g., Bohmann 
et al. 2014), but so far has been less often applied to birds. One current application is 
“bulk-bone” metabarcoding of partial, undiagnosable bones from archeological and 
paleontological sites (Murray et al. 2013; Honka et al. 2018). This work allows 
identification of the presence of previously unreported bird species through time and 
may be particularly beneficial in documenting changes in diversity of small-bodied 
birds whose bones are easily fragmented and therefore often unrecognized. Thus, 
metabarcoding may help in determining a baseline for conservation actions, esta¬ 
blish the former presence of currently extinct or extirpated species, and identify the 
former range of species for possible reintroduction. Another potential application 
could arise from metabarcoding remains after bird collisions. In cases where colli¬ 
sions with aircraft result in samples lacking sufficient feather material to allow 
morphological analyses, or in cases where mixed species flocks were involved in 
the strike, metabarcoding may provide important information on the avian taxa 
involved in collisions, thus allowing implementation of more effective management 
strategies (Dove et al. 2008). Metabarcoding could also be implemented to study 
collisions with turbines and wind energy technologies, where multiple birds may be 
struck through time and where the individuals struck did not die within the imme¬ 
diate vicinity and therefore cannot be identified through visual monitoring. 
The approach could also be applicable to medicines and foods derived from 
illegal harvests of birds, where DNA is present from multiple species or too 
degraded for traditional barcoding (Staats et al. 2016). 

Another area for which metabarcoding demonstrates great potential to contribute 
to avian ecology and conservation is through analysis of diets. A better understand¬ 
ing of diets could provide key insights, for example, about interspecies competitive 
interactions and predator-prey dynamics, which would in turn inform our under¬ 
standing of factors limiting reproductive success or survival. Ornithologists have 
long struggled to characterize diets by making foraging observations and analyzing 
gut contents or regurgitation. Metabarcoding can potentially yield vast quantities of 
data on these questions in a relatively short amount of time. Such analyses, however, 
have only recently been successfully applied to birds (Vo and Jedlicka 2014), likely 
due to the high acidity of bird feces, which can make obtaining DNA a challenge. 
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Jedlicka et al. (2017) used a metabarcoding approach to characterize the diets of 
western bluebirds (Sialia mexicana) in vineyards and found that they consumed 
primarily herbivorous insect taxa, rather than the predator or parasitoid taxa that may 
also be contributing to pest control. Also, McClenaghan et al. (2019) investigated the 
diet of a declining avian insectivore, the bam swallow ( Hirundo rustica). They found 
that this species is a broad-scale generalist and is able to feed nestlings a varied diet 
and take advantage of opportunistic food resources. Thus, this species should 
theoretically be resilient to changes in food availability. Other studies have 
investigated the use of aquatic insect prey sensitive to pollution by Louisiana 
waterthrush ( Parkesia motacilla) (Trevelline et al. 2016), as well as prey usage by 
semipalmated sandpipers ( Calidris pusilla) during migratory stopover (Gerwing 
et al. 2016). Alternatively, diet metabarcoding could help identify when birds are 
the prey items, for example, in the diets of felines, mustelids, rodents, and other 
invasive predators (Zarzoso-Lacoste et al. 2016). 

Metabarcoding has also been used to understand the bacterial communities of 
birds and impacts on their physiology and health (Waite and Taylor 2015). For 
example, hoatzins ( Opisthocomus hoaziri) consume leaves, and fermentation is 
thought to take place in their enlarged crops, reminiscent of cattle and horses rather 
than other birds. A comparison of hoatzin and cow foregut and hindgut bacterial taxa 
suggested that indeed, the foregut taxa of hoatzin and cow were more similar to each 
other than to their own hindgut community (Godoz-Vitorino et al. 2012). This 
suggests that the bacterial community of the hoatzin crop may represent a conver¬ 
gent, evolutionary adaptation for dealing with a herbivorous diet. In contrast to this, 
the critically endangered kakapo (Strigops habroptilus ), which also consumes plant 
material, but not entire leaves, has a very different foregut bacterial community and 
therefore is unlikely to perform fermentation and has adapted to its diet in different 
ways (Waite et al. 2013, 2014). Similarly, black ( Coragyps atratus ) and turkey 
vultures ( Cathartes aura), which scavenge decaying carcasses rife with toxic bacte¬ 
rial compounds, host a unique gut microbiome, which likely originates from their 
diet (Roggenbuck et al. 2014). Additional analyses of the avian microbiome could 
also play a role in understanding, for example, the bacterial communities which may 
be critical for coloration and degradation of feathers, mate attraction, nestling 
development, reproductive investment, as well as migratory ability (Jacob et al. 
2014, 2015). All of these aspects influence survival and reproduction and therefore 
would be particularly important to understand for endangered avian taxa. 

2.4.2 Metagenomics 

While implementation of metabarcoding approaches is becoming more common, 
few metagenomics studies have involved birds, particularly wild birds. One such 
recent paper examined the cecal metagenome of the greater sage grouse 
(Centrocercus urophasianus , Kohl et al. 2016). The greater sage grouse regularly 
(and at some points in the annual cycle exclusively) consumes sagebrush ( Artemisia 
sp.), a plant that contains toxic secondary compounds. Kohl et al. (2016) sequenced 
total DNA from the cecal microbiota and assembled gene sequences (rather than full 
genomes). They found that the sage grouse cecal metagenome was enriched for 
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genes related to breakdown of these compounds, as compared to the microbiota of 
chicken and mammalian herbivores. This suggests that greater sage grouse and their 
microbiota may be specially adapted for a diet of sagebrush and other plants with 
similar toxic secondary compounds. 

Metagenomics could be particularly useful for understanding pathogens and 
disease in wild birds. Disease is an important factor limiting population growth 
rates and carrying capacity; therefore, gaining a better understanding of the 
infections that birds carry and how they spread could have important management 
and conservation implications. Beyond this, wild birds may serve as the reservoir for 
diseases that impact humans (e.g., avian influenza and West Nile virus), so a better 
understanding of bird diseases would benefit people as well (Kapgate et al. 2015). In 
these cases, having a more complete, less-biased picture of the bacteria, viruses, and 
fungi present (and potentially accurate representations of their abundance) would be 
especially important. A metagenomics approach was recently taken to investigate 
the virus community in domestic turkeys (Day et al. 2010), and a similar approach 
could be taken for monitoring and surveillance of wild and endangered birds. 

2.4.3 Strengths and Challenges 

While the above examples provide insights into the current and potential future 
applications of metagenomics and metabarcoding studies, it is useful to have an 
understanding of the challenges these studies face. 

In any metabarcoding study, one of the early steps involves PCR amplification 
with theoretically universal primers. Unfortunately no primer is truly universal, and 
this step may introduce taxonomic biases during amplification (Deagle et al. 2014), 
which can be difficult to predict ahead of time. These biases include not only 
amplification success of some species and failure for others but also biases in the 
strength of amplification with some species amplifying strongly and others 
amplifying weakly. This makes interpretation of metabarcoding sequence abun¬ 
dance challenging (Elbrecht and Leese 2015). Beyond this, sequences obtained 
from metabarcoding are often compared to available DNA barcode sequence 
databases, such as the Barcode of Life (Ratnasingham and Hebert 2007). Taxonomic 
biases in the database can lead to biased sequence identification or identification to 
higher taxonomic levels only (such as family or order; Kvist 2013). A related 
problem is that DNA barcoding loci may in some cases lack sufficient resolution 
(particularly for degraded samples which make use of shorter barcode loci), and thus 
two or more taxa may share the same barcode sequence and thus could not be 
identified uniquely to the species level. This is a common issue for bird diet items 
such as plants and may exist for arthropods and some higher vertebrates as well 
(Starr et al. 2009; Janzen et al. 2005). In general, data on presence of taxa may be 
more reliable than data on absence or abundance of taxa. 

The benefits of the metabarcoding approach, however, offset the challenges. If 
characterizing biodiversity is the goal, then this approach targets sequencing to 
limited areas of the genome that are comparatively well covered in online databases 
(e.g., the COI gene, Ratnasingham and Hebert 2007). Also, by targeting the 
sequencing effort, many samples can be combined within a single sequencing run, 
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and smaller instruments with a cheaper total run cost may be sufficient (e.g., the 
Illumina MiSeq), making this work feasible for projects with limited budgets or very 
large sample sizes. Beyond this, because the protocols generally start with a PCR 
amplification step, it becomes possible to append overhanging sequences to the 
primers that match sequencing adapters and thus more quickly and efficiently 
conduct library preparation for sequencing, though these overhangs may also 
cause taxonomic biases during amplification (Berry et al. 2011). 

One of the benefits of the metagenomics approach is that it removes the initial 
PCR step that generates much amplification bias. Because sequencing may yield 
multiple and/or longer regions, taxonomic resolution may be increased with this 
approach, and therefore result in more comprehensive characterization of the com¬ 
munity of interest (Srivathsan et al. 2015). However, it should be noted that few 
extensive tests of metagenomics sequencing for community characterization have 
been conducted and these approaches may still suffer from technical problems, such 
as the presence of inhibitors or biased ligation of sequencing adapters. Another 
strength of the metagenomics approach, as mentioned above, is that when sequenc¬ 
ing is no longer targeted to one or a few genes, a wider range of questions may be 
addressed, leading to deeper insights about microbial community interactions, gene 
expression, and gene function (Fierer et al. 2012; Poretsky et al. 2009). 

Metagenomics studies also face several challenges. Because sequencing is not 
targeted, more sequencing is necessary, and therefore run costs are likely to be 
higher. In addition to this, datasets are likely to be less complete, as compared to a 
traditional genomic study, because rather than sequencing a single individual, 
hundreds or thousands of individuals/species may be sequenced at the same time. 
This also leads to bioinformatic and computational challenges of assembling reads 
from many individuals into useful datasets (Wooley et al. 2010). 

2.4.4 Future Directions 

There are several areas where advances in metagenomics and metabarcoding could 
make a significant impact on our knowledge and understanding of avian conser¬ 
vation. Gaining a better understanding of the factors that influence the accuracy of 
the relationship between abundance in the sample and the proportion of sequencing 
reads would yield many new insights. For example, the inclusion of accurate 
abundance information into analyses of food web dynamics would indicate not 
only that a prey taxon was consumed but its relative importance in the diet. 
Similarly, for biodiversity monitoring, abundance of a species’ sequences could 
indicate the relative numbers of individuals living in different areas or the relative 
abundance of pathogenic microorganisms compared to commensal or beneficial 
taxa. For metabarcoding, many factors may influence abundance, including but not 
limited to storage and DNA extraction methods, primer bias, polymerase bias, and 
even bioinformatics approaches for quality control and taxonomic assignment 
(Vo and Jedlicka 2014; Deagle et al. 2013, 2014; Kopylova et al. 2016; 
Krehenwinkel et al. 2017; Nichols et al. 2018). Metagenomic studies may suffer 
from less-biased sequence abundance, but little is known. Potential confounding 
factors could include copy number issues, biased digestion, and biased databases for 
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taxonomic comparison and functional annotation. Despite these potential con¬ 
founding factors, several studies are beginning to suggest that a relationship may 
exist between abundance and sequence proportions (e.g., Srivathsan et al. 2015; 
Evans et al. 2016, and others). Further, for metagenomic studies, improvements in 
sequencing (number, quality, and length of reads) and reduction in cost would be 
valuable, but perhaps the most beneficial would be continued improvements in our 
computational abilities, both with regard to the raw computing power and software 
to handle complex datasets. 


2.5 Systematics and Species Limits 

To conserve species, scientists and managers need to know what are distinct 
species—that is, how do data inform us of species status and limits based on 
accepted species definitions. In addition, an understanding of hierarchical levels of 
genetic structure below the species level is necessary to also optimally conserve 
genetic variation. Thus we also need to have clear-cut concepts and definitions of 
such entities as subspecies, evolutionarily significant units (ESU), and the 
Endangered Species Act (ESA) legally defined “distinct population segments” 
(DPS). While there are problems with each of these categories because of differences 
in the definitions that have been proposed (Frankham et al. 2012; Garnet and 
Christidis 2017; Haig et al. 2006; Funk et al. 2012) and difficulties obtaining relevant 
data (such as hybrid fitness and local adaptation), the roles of genetic, and now 
genomic, data have proven useful to elucidate these categories for policy and 
management decisions. Note, the chapter “Avian Species Concepts in the Light of 
Genomics” in this volume is devoted to species concepts and how genomic data can 
be incorporated into criteria to distinguish species, so our discussion here will largely 
be concerned with their applications to conservation. 

2.5.1 Species Definitions and Limits 

These can be tested with genomic data to determine species boundaries and 
relationships and amounts and directions of introgression at contact zones 
(Ottenburghs et al. 2017). The expansion of data to the genomic level has greatly 
increased our ability to determine these variables (Toews et al. 2016) to better test 
species limits. The ability to discover chromosomally local “islands of divergence” 
between taxa that are otherwise very closely related genetically can reveal the 
presence of barriers to unrestricted gene flow, the directionality and timing of 
introgression, and the roles of sexual selection and local adaptation in speciation. 
In addition, these data are also very useful for assessment of different subspecies, 
ESU and DPS definitions, and could also provide information on the role of adaptive 
divergence that might limit a taxon or populations to a particular range or habitat. 

Although biologists have defined a plethora of different species concepts 
(Frankham et al. 2012; Garnett and Christidis 2017), one in particular, the Biological 
Species Concept (BSC, Mayr 1942), has largely dominated the systematics of birds. 
More recently, a set of determined ornithologists have advocated for the application 
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of the Phylogenetic Species Concept (PSC, Eldridge and Cracraft 1980), which 
would potentially double the number of described avian species to over 20,000 
(Barrowclough et al. 2016). Conservation biologists argue that species definitions 
and designations need standardization and better control (Mace 2004; Frankham 
et al. 2012), even proposing an official body of scientists be established to carry out 
this role (e.g., Garnett and Christidis 2017). 

For both of ornithology’s primary species concepts and others that could be 
applied (Hill 2017; Kraus et al. 2012), genomic approaches would provide greater 
resolution of species limits. Assessment of fine-grained genome-wide sequence 
variation across avian contact zones would provide information on divergence 
(Fig. 1, middle left) but also on the level and pattern of introgression across the 
zone and into the parental taxa (e.g., Poelstra et al. 2014; Baldassare et al. 2014; 
Toews et al. 2016; Kearns et al. 2017). For example, only six small genomic regions 
were differentiated between the morphologically and mitochondrially (Gill 1997; 
Shapiro et al. 2004) distinct blue-winged (Vermivora cyanoptera) and golden¬ 
winged (Vermivora chrysoptera ) warblers (Toews et al. 2016), and in proximity to 
four of those regions were genes potentially involved in the plumage differences 
between the two taxa. Based on the genomic sequence analyses, these two “species” 
have had a long and complicated history of interaction and, despite this interaction 
and their high genomic similarity, are likely best viewed as distinct species. There 
are a number of other cases emerging for birds in which evidence of apparently 
reproductively isolated taxa show minor, but detectable, overall genomic-level 
differentiation (e.g., differentiated subspecies in the bam swallow Hirundo rustica , 
Safran et al. 2016) or only a few islands of differentiation (as the above warbler 
case), and these will likely necessitate modification in the criteria used for species 
and subspecies designations. 

2.5.2 ESU and DPS Definitions 

Other definitions of importance in conservation management are what have been 
defined as evolutionarily significant units (ESU) and distinct population segments 
(DPS). The latter is actually incorporated as a defined unit of conservation for only 
vertebrates under the US Endangered Species Act (see Pennock and Dimmick 1997 
and follow-ups for a discussion of DPS and how employing ESU alternatives could 
muddy the waters). 

For a DPS to be defined and listed, the population under consideration must be 
discrete and significant and have an appropriate level of threat or endangerment. 
Although little guidance on these three criteria were provided in the act itself, 
additional legislation and discussion have indicated that the discrete population 
segment should “differ markedly from other populations of the species in its 
genetic characteristics” and genetics has often played a role in these designations. 

There have been many definitions of ESU proposed since it was first discussed in 
a meeting review by Ryder (1986), including ones that involved mitochondrial DNA 
reciprocal monophyly and significant allele frequency differences (Moritz 1994) and 
identification of loci (and/or morphological or ecological traits) that reflect adapta¬ 
tion to local environments (Waples 1995; Crandall et al. 2000). 
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While genetic markers have been used in ESU and DPS approaches to a great 
extent (see examples in Fleischer 1998; Phillimore and Owens 2006), only recently 
have genomic methods been applied to confirm subspecies or ESU designations. A 
nice example of genomic differentiation and even potential local adaptation among 
subspecies and populations is that of barn swallows in Eurasia and North America 
(Safran et al. 2016; Scordato et al. 2017). These studies show clear genomic-level 
divergences based on PC A and structure analyses, with varying levels of 
hybridization or gene flow among the taxa based on hybrid analyses. Isolation by 
distance versus isolation by adaptation analyses suggest a strong component of the 
divergence is due to the latter, but only on a macrogeographic level. Another recent 
study showed only minor genomic-level differentiation using RAD-seq markers 
(and morphology) between Japanese and Hawaiian populations of the black-footed 
albatross (Dierickx et al. 2015), although the authors could not exclude the possi¬ 
bility that the slight differentiation detected might be meaningful for adaptation and 
thus conservation management. In most cases, genomic markers enable a much 
greater degree of resolution of taxon- or population-level differentiation than ana¬ 
lyses based on mtDNA or nuclear sequences or microsatellites, but we must caution 
that the greater resolution of genomic differences attained by large numbers of 
markers may not necessarily imply biological significance. 


2.5.3 Future Directions 

Fitting structured and legal definitions like species, ESU, and DPS to biological data 
can be difficult, and genomic data may not really add to our ability to elucidate these 
concepts. Perhaps the key point of utility for avian conservation is really to under¬ 
stand how populations or species are related to each other, how they may be locally 
adapted, and what are the current and historical levels of gene flow among them. 
Population genomics is enhancing our ability to quantify these aspects and would 
allow us to determine how these entities would need to be managed. It is now 
possible to genotype hundreds of individuals at thousands of markers without 
needing a reference genome (although many more reference genomes are becoming 
available; Callicrate et al. 2014; Jarvis et al. 2014; Zhang 2015). Such a plethora of 
information enables us to explore how genome structure and evolution contributes to 
differentiation and speciation. We are now able to assess functional and adaptive 
differences between populations using genomic and transcriptomic data (e.g., Safran 
et al. 2016; Taylor and Mason 2015). Genome-wide markers such as single nucleo¬ 
tide polymorphisms (SNPs) can be used to genotype the same loci in ancient and 
modem samples, and although working with ancient DNA presents many chal¬ 
lenges, genomics helps us overcome some of the problems with fragmented DNA. 
Thus, ultimately we will be able to document the historical contexts (past population 
size, stmcture, migration levels) using approaches such as Pairwise Sequentially 
Markovian Coalescent (PMSC; Li and Durbin 2011) and similar analyses for 
present-day situations (e.g., Nadachowska-Brzyska et al. 2015, Murray et al. 
2017a, b). These can provide information useful for the management and recovery 
of endangered avian species. 
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2.6 Adaptation to Climate Change and Other Stressors 

Adaptation occurs in response to various biotic and abiotic factors. As anthropogenic 
stressors increase, studying species response has become essential. Understanding 
how populations react to climate, habitat loss, and other compounding stressors can 
help us conserve species as a whole. In the past, studying adaptation required 
previous knowledge of genes involved in specific pathways. Early work focused 
on identifying candidate genes responsible for the change in phenotype through 
methods such as quantitative trait locus (QTL) analyses. With the advent of next- 
generation sequencing, our capacity to study adaptation in response to anthropo¬ 
genic stressors has drastically improved. The field has expanded to include 
multiple pathways of adaptation, such as transcriptional regulation, selection in 
exonic regions, and posttranscriptional regulation. Next-generation sequencing 
allows us, without any a priori knowledge, to identify loci under selection that 
may be important in adaptation (Fig. 1, top left; Stillman and Armstrong 2015). 
Additionally, it enables us to study responses within a single generation (e.g., 
phenotypic plasticity) using methods such as RNA-seq (Wang et al. 2009; 
avian transcriptomics review: Jax et al. 2018). 


2.6.1 Species Response to Climate Change 

When faced with a stressor, species respond in one of four ways: (1) evade, (2) die 
(local extinction), (3) acclimate, or (4) adapt. While it is more straightforward to test 
for evasion or local extinction, the latter two have been historically more difficult to 
tease apart (Gienapp et al. 2008). Climate change-related range shifts entail an 
overall shift toward higher latitudes and higher elevations (Chen et al. 2011). 
Range shifts are considered to have two components—the cool-edge expansion 
(evasion) and the warm-edge contraction (local extinction; Wiens 2016). Range 
expansion and local extinction have been a primary focus in the response to climate 
change, whereas the importance of evolution occurring at the warm edge is often 
overlooked (Hoffmann and Sgro 2011; Vedder et al. 2013). 

Population response in the warm edges of a species’ range can occur through 
adaptation and acclimation or genotypic specialization and phenotypic plasticity 
(Box 1). A well-documented example of population response to climate change is 
the occurrence of earlier mean breeding dates in many bird taxa (Charmantier and 
Gienapp 2013). Breeding timing is often based on prey abundance, which is in turn 
dependent on climatic factors. In great tits (Parus major), breeding season depends 
on a brief annual peak in caterpillar abundance, their offspring’s food source, which 
is determined by spring temperatures (Visser et al. 2004). Vedder et al. (2013) found 
evidence of plastic response in breeding time in Wytham Woods great tits. Based on 
these data, the authors modelled extinction risk and found that the absence of this 
phenotypic plasticity would increase the likelihood of population extinction by 
500-fold. They, however, note that little is known about the underlying causes and 
limits of this plasticity. The usage of next-generation sequencing methods could help 
identify the genomic basis of this response. 
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Fig. 2 Representation of 
reaction norms of three 
genotypes for a single trait. 
Genotypes A and B show 
discontinuous and continuous 
phenotypic plasticity, 
respectively. Genotype C 
shows no plasticity for this 
trait since its phenotype 
remains consistent across 
environments 



Box 1: Genotypic Specialization Vs. Phenotypic Plasticity 

Genotypic specialization is the process by which canalized, locally adapted 
phenotypes are present in an environment, and phenotypic specialization 
(or plasticity) is differentiation by regulated gene expression in response to 
the given environment (Wahl 2002). Genotypic specialization involves the 
genetic makeup of a population changing over generations due to the differ¬ 
ential survival or reproductive success of certain genotypes in an environment 
or natural selection (DeBiasse and Kelly 2016). Conversely, phenotypic 
plasticity involves a change in phenotype in response to the environment 
within an organism and within a single generation (DeBiasse and Kelly 
2016). Historically, the influence of the environment on the genome was 
assigned primarily to natural selection or gene-by-environment (GxE) 
interactions. A GxE interaction is when two genotypes respond to the same 
environment differently. Recently, however, the response of a single genotype 
in different environments has been attributed a more dynamic role (Fig. 2). For 
example, given the same genotype, an individual’s phenotype can be a result 
of the environment it is exposed to from development through adulthood 
(Fusco and Minelli 2010); this is phenotypic plasticity. 

Phenotypic plasticity comes with a cost, as species have to maintain the 
genetic and cellular mechanisms required to produce plastic responses, e.g., 
regulatory genes and/or enzymes (Schemer 1993). Additionally, there could 
be costs associated with information gathering about environmental conditions 
and genetic costs involving linked genes, pleiotropy, and epistatic interactions 
(DeWitt et al. 1998). Therefore, plasticity is not likely to evolve in a popula¬ 
tion that does not experience a variety of environments or does not possess the 
genetic variance required for a plastic response in that specific trait. When 
environmental heterogeneity is fine-grained, individuals will experience vari¬ 
ous environmental conditions during their lifespan, requiring an increased 
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acclimation capacity, and therefore phenotypic plasticity is more likely to 
evolve (Storz et al. 2010). However, when environmental heterogeneity is 
coarse-grained, individuals experience a limited range of environmental 
conditions, and therefore phenotypic plasticity is less likely to evolve 
(Storz et al. 2010). 

When an individual’s phenotype is altered in the direction of the 
local optima, plasticity is said to be adaptive. An adaptive plastic response 
can later become genetically encoded via natural selection through a process 
known as genetic assimilation (Ghalambor et al. 2015). This process allows 
an organism exposed to a constant stressor to eventually develop a bio¬ 
logically robust response in the direction of its previously plastic response 
(DeBiasse and Kelly 2016). 


2.6.2 Approaches for Assessing Response to Climate Change 

The genomic basis of response to climate change can be studied via multiple 
approaches (Table 2). It can also be studied on multiple levels: genomic, tran- 
scriptomic, and epigenomic. Reduced representation methods, such as RAD-seq 
(Baird et al. 2008), and sequence capture provide genomic information, whereas 
RNA-seq and reduced representation bisulfite sequencing (RRBS; Meissner et al. 
2005) are used for transcriptomics and epigenomics, respectively. Multilevel 
approaches are important in parsing the relative effects of adaptation and acclimation 
or genotypic specialization and phenotypic plasticity. 

In populations where candidate genes are responsible for most of the variation in 
potentially evolving traits, sequence capture can be employed to isolate and enrich 
those genes for analysis. This approach can be used to identify genotypic special¬ 
ization if there is a priori knowledge of the species’ genome. A classic example of 
this method is the usage of exome capture to identify the genetic basis of high- 
elevation adaptation in Tibetans (Yi et al. 2010). Long-term study of a single popu¬ 
lation allows researchers to differentiate between genetic and environmental contri¬ 
butions to trait change. Using whole-genome resequencing or reduced representation 
genomics, researchers can look for signatures of selection that correspond to climate- 
related phenotypic changes. This method, while effective, requires studying a popu¬ 
lation over multiple years, which is not always possible. 

Common garden experiments and transplant experiments offer a method by 
which one can tease apart the roles of local adaptation and plasticity in adaptive 
success. In a single generation, RNA-seq and RRBS can be used to identify plastic 
responses to environmental change. In a multi-generation experiment, researchers 
can identify signatures of selection and heritability of epigenetic modifications. For 
example, to test for plasticity in high and low elevation populations of the rufous- 
collared sparrow (Zonotrichia capensis), Cheviron et al. (2008) transplanted indi¬ 
viduals to a low elevation common garden. None of the transcripts that were 
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Table 2 Next-generation approaches for assessing the genetic basis of response to climate change 
in wild avian populations (Adapted from Hoffman and Sgro 2011) 


Approach 

Advantages 

Limitations 

Next-gen 

sequencing 

method 

Assessing variation in 
candidate genes for 
relevant traits 

Useful when candidate 
genes are responsible 
for most of the variation 
in relevant trait 

Requires a priori 
knowledge of both the 
loci and the mechanism 
of adaptation 

Sequence 

capture 

Useful only if few 
genes are responsible 
for most of the variation 
in the relevant trait 

Long-term study of a 
single population 

Useful in long-term 
study populations 

Time-consuming 
because it requires 
repeated sampling over 
multiple years/seasons 

Sequence 

capture 

Can parse out genetic 
and environmental 
contributions 

RAD-seq 

Genome 

resequencing 

Common garden and 
transplant experiments 
across a climatic gradient 
(single generation) 

Can differentiate 
between genotypic 
specialization and 
phenotypic plasticity 

Not all species can be 
transplanted and/or live 
in captivity 

RNA-seq 

RRBS 

Common garden and 
transplant experiments 
across a climatic gradient 
(multi-generation) 

Can identify signatures 
of selection 

Not all species can be 
transplanted and/or live 
in captivity 

Genome 

resequencing 

RAD-seq 

RNA-seq 

RRBS 

Experimental evolution 
in artificial environments 
designed to simulate 
climate change 

Can differentiate 
between genotypic 
specialization and 
phenotypic plasticity 

Not all species can be 
transplanted and/or live 
in captivity 

Genome 

resequencing 

Can identify signatures 
of selection 

Difficult to extrapolate 
to wild populations 
experiencing 
compounding stressors 

RAD-seq 

RNA-seq 

RRBS 


differentially expressed at native elevations remained different in the common 
garden, demonstrating a considerable level of phenotypic plasticity in cold and 
hypoxia response (Cheviron et al. 2008). While common garden and transplant 
experiments yield valuable insights into population response to stressors, they are 
limited to species that can be transplanted and survive in a captive environment. 

Experimental evolution in artificial environments designed to simulate climate 
change provides a unique opportunity to study adaptation and acclimation to a 
controlled stressor. For example, wild guinea pigs that were exposed to heat 
treatments showed epigenetic modifications that were passed on to their offspring, 
providing them with improved resilience to environmental temperature increase 
(Weyrich et al. 2016). Simulated environment experiments allow researchers to 
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study response along a single axis; however, whether these data can be extrapolated 
to wild populations experiencing compounding stressors remains to be seen. 


2.6.3 Future Directions 

Studying the genomic basis of population response to climate change is a relatively 
new field in ornithology. Future studies that use transplant and/or simulated 
experiments and next-generation sequencing are needed to understand the relative 
roles of adaptation and acclimation in response to climate change. Additionally, 
research on rapid evolution in invasive species could provide some insight into how 
species will respond to climate change in the future (Moran and Alexander 2014; 
Chown et al. 2015). Nonetheless, the function of many avian genes is still unknown, 
particularly for genes with multiple functions and small effect sizes. Studies that 
work to characterize gene function across avian taxa are greatly needed. Still, as 
sequencing becomes more cost-efficient, multilevel studies that compare genomics, 
transcriptomics, and epigenomics of a population will become possible, thereby 
creating a more complete picture of the avian evolutionary response to climate 
change. 


3 General Conclusions 

The advent of genomics and next-generation sequencing methods has enabled 
significant advances in our understanding of avian conservation. It has the potential 
to facilitate the preservation of adaptive diversity—particularly as technology 
progresses and our understanding of gene function improves. In combination with 
new computational approaches, we have gained incredible power to reconstruct 
population and evolutionary histories of endangered species and to define units for 
conservation. In addition, new high-throughput methods increase our ability to learn 
about the biology of endangered species often without handling or disturbing them 
(such as noninvasive analyses of population size, inbreeding, movements, diet, and 
disease). Some challenges remain, most notably that we lack detailed information 
about gene function across taxa. Nonetheless, the increasing availability and 
decreasing cost of obtaining large-scale genomic data are resulting in a burgeoning 
number of studies that address this shortcoming, and our understanding of gene 
function stands to improve dramatically in the next few years. Our ability to harness 
the power of genomics will be amplified by using multiple genomic approaches to 
creatively answer pressing questions in avian conservation. 
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Abstract 

Recent palaeontological evidence is clear that birds are extant dinosaurs. Evolving 
along the lineage Diapsida—Archelosauria—Archosauria—Avemetatarsalia— 
Dinosauria—Ornithoscelida—Theropoda—Maniraptora—Avialae, birds are the 
latest example of dinosaurs emerging from catastrophic extinction events as 
speciose and diverse. Indeed, rather than being wiped out by the Cretaceous- 
Paleogene meteor strike, they are the survivors of at least three extinction events. 
Dinosaurs capture the public imagination through art, literature, television and film, 
most recently through the Jurassic Park/World franchise. Claims in the scientific 
literature of isolating dinosaur DNA (from amber-preserved insects or elsewhere) 
have largely been debunked. Nonetheless, the overall structure of dinosaur 
genomes along the above lineage can be determined by inference from 
chromosome-level genome assemblies. Our work focused first on determining 
the likely karyotype of the avian ancestor (probably a small, bipedal, feathered, 
terrestrial Jurassic dinosaur) finding great similarity to the chicken. We then 
progressed to determining the likely karyotype of the diapsid ancestor and the 
changes that have occurred to form extant animals. A combination of bioinformat¬ 
ics and molecular cytogenetics indicates considerable interchromosomal rearrange¬ 
ment from a “lizard-like” karyotype of 2n = 36—46 to one similar to that of certain 
turtles from 275 to 255 million years ago (mya). Remarkable karyotypic similarities 
between some turtles and chicken suggest identity by descent, in other words that, 
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aside from -7 fissions, there were few interchromosomal changes from the 
archelosaur (bird-turtle) ancestor to the Avemetatarsalia (dinosaurs and pterosaurs), 
through the theropods to modern birds. Indeed, a similar rate of change beyond 
255 my a would have meant that the avian-like karyotype was in place about 
240 mya when the first dinosaurs and pterosaurs emerged. We mapped 
49 intrachromosomal changes in the intervening period, finding significant gene 
ontology enrichment in homologous synteny block and evolutionary breakpoint 
regions. The avian-like karyotype with its many chromosomes provides the sub¬ 
strate for variation (the driver of natural selection) through increased random 
segregation and recombination. It thus may impact on the ability of dinosaurs to 
survive and thrive, despite multiple extinction events. 
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This is a book about dinosaur genomes. Everything you’ve read so far represents an 
exposition and analysis of dinosaur genomics. Birds did not evolve from dinosaurs; 
birds are not related to dinosaurs. On the contrary, the latest palaeontological 
evidence is very clear that birds are dinosaurs. Dinosaurs have been ever present 
in popular culture and the creative arts since the earliest fossil discoveries, promoted 
through film, television, press, art and literature. And, rather than being a group of 
animals that were wiped out by a very well-known extinction event caused by a 
meteor, they are in fact the survivors of several extinction events. Each time they 
emerge more diverse and are probably more likely to survive whatever each extinc¬ 
tion event might throw at them. In a recent study, we suggest this may be due, in part, 
to their unique genomic structure, i.e. their karyotype. 


1 What Are Dinosaurs and Where Do They Fit in Vertebrate 
Evolution? 

Amniota is the vertebrate clade that includes mammals, birds and non-avian reptiles. 
It is a remarkably diverse group of terrestrial tetrapods believed to have shared a 
common ancestor 325 mya (million years ago) in the Permian period (Shedlock and 
Edwards 2009). Amniotes diverged into the lineages that ultimately became 
mammals (Synapsida) and reptiles (Diapsida) during this time. Diapsida has 
-17,500 extant members (-10,500 of which are birds) and is subdivided into two 
groups (Fig. 1). Lizards, snakes and tuataras form a monophyletic group— 
Lepidosauromorpha—that is the sister group to the Archelosauria (crocodilians, 
dinosaurs, pterosaurs, turtles, birds), all of which shared a common ancestor 
275 mya (Hedges et al. 2015; Shedlock and Edwards 2009). The stem groups of 
Lepidosauria and Archelosauria also include several extinct lineages that existed in 
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Fig. 1 The lineage from the diapsid ancestor to modem birds including the major extinction events 


the Triassic period including the rhynchosaurs (Ezcurra et al. 2014). Of the 
lepidosaurs, the tuatara diverged 272 mya making them an extraordinarily ancient 
species and the only extant example of its order, the Rhynchocephalia (Rauhut et al. 
2012). Assuming that the majority of recent molecular phylogenies of amniote 
interrelationships are correct, turtles (Testudines) diverged from Archelosauria first 
(around 255 mya), forming the crown-group Archosauromorpha (crocodilians, 
pterosaurs and dinosaurs, including birds). Prolonged debate previously surrounded 
the phylogenetic relationship of turtles (Testudines). The lack of temporal fenestrae 
in the skull led for many years to their assumed placement as a primitive anapsid; 
however, molecular evidence now places the Testudines as a sister group to the 
Archosauromorpha (Chiari et al. 2012; Crawford et al. 2012; Shaffer et al. 2013). 
The earliest recorded Testudine fossil (Odontochelys semitestacea) is Late Triassic 
in age (237-223 mya) (Benton et al. 2015; Nicholson et al. 2015) although stem 
turtle species (Eunotosaurus africanus ) have been dated even further back to 
260 mya (Lyson et al. 2010). It is reasonably well established therefore that the 
Testudine divergence occurred around 255 mya (Fig. 1). 

Archosauromorpha exhibits a basal split into the crocodile-line (Pseudosuchia or 
Crurotarsi) and dinosaur (bird)-line (Omithodira or Avemetatarsalia) clades. The 
group of animals most commonly known as dinosaurs is formally defined as the 
clade including Triceratops , Passer (songbirds) and all of the descendants of their 
common ancestor, with birds nested within the theropod clade Maniraptora (Fig. 1). 
Previous research dated the oldest unequivocal dinosaur fossils to 230 mya 
(Martinez et al. 2011); however, recent fossil finds now indicate that, although 
reported divergence times of major clades vary between different studies, the earliest 
dinosaur divergence from non-dinosaur dinosauromorphs was approximately 
240-245 mya (Nesbitt et al. 2013). Pterosauria, the sister group to Dinosauromorpha 
diverged around 245 mya, although an earlier origin is possible. The ornithodiran- 
crurotarsan (bird-crocodilian) divergence occurred in the Lower Triassic period, or 
potentially earlier around the time of the Permo-Triassic boundary 252 mya, despite 
older studies dating the crocodilian split from birds to 219 mya (Shedlock and 
Edwards 2009). Species numbers appear to have remained relatively low in both 
the lepidosaurs and the archosauromorphs until the Permo-Triassic mass extinction 
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event (PTME) devastated the synapsids around 251 mya (Benton and Twitchett 
2003). Massive volcanic eruptions in the Siberian Traps are thought to have initiated 
the conditions that created the PTME. These eruptions led to a prolonged period of 
global warming, creating anoxic conditions that devastated 80-90% of life on land 
and in the oceans (Benton and Twitchett 2003). The subsequent period of climate 
change led to increasingly arid conditions that marked out the beginning of the 
Triassic as a period of extraordinary ecological change. Estimates indicate that it 
took 10-15 million years before ocean reefs, forests and vertebrates were 
re-established after the PTME (Benton et al. 2013). 


2 Dinosaurs Were (and Are) Speciose and Abundant 

In terms of species diversity and abundance, dinosaurs were still relatively low in 
number over the first 30 million years of their evolution but, by the mid Jurassic, 
began to increase vastly in abundance, geographical spread and body size (Benton 
et al. 2014). The following 135 million years is remarkable for being a period not 
only for when dinosaurs were the dominant vertebrates but also for being a time 
when they displayed an extraordinary range of species diversity. Throughout this 
period the dinosaurs survived further extinction events including the Carnian-Norian 
(CNEE) 228 mya that saw the end of the rhynchosaurs and dicynodonts (Brusatte 
et al. 2008) and the End-Triassic mass extinction event (ETME) 201 mya that also 
devastated the Pseudosuchia (or Crurotarsi—the crocodilian ancestors), leaving only 
23 extant species. The period after the ETME corresponded with a steady increase in 
dinosaur diversity and abundance arguing against the widely held belief that the 
release of an ecological niche by the extinction of their competitors led to a surge in 
dinosaur disparity (Brusatte et al. 2008). There are now over 1000 known species of 
dinosaur that appear in the fossil record with around 30 more being identified each 
year particularly in regions rich in newly discovered fossils such as China. They 
were significantly decimated by the Cretaceous-Paleogene (K-Pg) extinction event 
66 mya but emerged again as modem birds, with over 10,000 phenotypically diverse 
representatives. 

The extraordinary species diversity and abundance seen in the dinosaurs is often 
attributed to the eradication of competitor species that allowed the dinosaurs to 
flourish. However, it has also been suggested that these high levels of diversity 
and abundance reflect adaptations unique to dinosaurs that enabled them to survive 
through such harsh conditions, while other species perished. For example, the 
extraordinary growth rates evidenced by bone growth patterns along with highly 
adapted respiration systems such as pneumatised bones (Farmer and Sanders 2010) 
and unidirectional respiration are both considered to be key features that enabled the 
dinosaurs to flourish (O’Connor and Claessens 2005). Interestingly, these very 
adaptations that may have led to the success of the dinosaurs are also key features 
that contribute to the success of birds. 

There is now little doubt that modem birds are the latest in a long line of dinosaurs 
with fossil evidence showing that both groups shared features such as feathers, 




Jurassic Park: What Did the Genomes of Dinosaurs Look Like? 


335 


oviparity, brooding behaviours and skeletal similarities (Varricchio et al. 2008). The 
question thus arises whether non-avian dinosaurs also exhibited distinctively avian 
features at a genetic level. Originating around 150 my a in the late Jurassic, birds 
evolved from a theropod lineage (Chiappe and Dyke 2002) at a time when the 
supercontinent Pangaea was separated into two landforms—Laurasia and 
Gondwana. The fossil Archaeopteryx lithographica dating back to 150 mya and 
found in the nineteenth century in late Jurassic limestone in Germany (Meyer 1861) 
provides evidence of a transitional species between the non-avian and avian 
dinosaurs. Although previously considered to be the fossil representative of an 
early modem bird, features such as a bony tail and teeth mle A. lithographica out 
from being considered a tme avian ancestor (Mayr et al. 2007). As the oldest 
unambiguous fossil representative of modern birds Vegavis is an aquatic bird 
classified within Anseriformes and most closely related to Anatidae—ducks, geese 
and swans (Clarke et al. 2005). Dating back to -67 mya, the discovery of this fossil 
supports the notion that representatives of modem birds were co-extant with 
non-avian dinosaurs prior to the Cretaceous-Paleogene (K-Pg) boundary 66 mya 
(Clarke et al. 2005). The inherent difficulties in fossil dating due to geographic and 
depositional sampling bias, however, have led to much controversy in the field of 
palaeontology (Chiappe and Dyke 2002), meaning that analyses at a genomic level 
are a useful complement to a fossil record that could represent actual avian ancestors 
imperfectly. Interestingly, the dinosaur ancestor of birds is generally considered to 
be a bipedal, terrestrial, relatively small (small size being a pre-adaptation to flight) 
Jurassic dinosaur with limited flying ability, not dissimilar to the Galliformes 
(Witmer 2002). 

Until the publication in 2014 of a revised avian phylogeny, the timing of avian 
diversification has been a subject of much debate (Jarvis et al. 2014). The first avian 
divergence is now considered to have taken place around 100 mya when the 
Palaeognathae (ratites and Tinamous) diverged from the Neognathae (Galloanseres 
and Neoaves). Within the Palaeognathae, the ratites and Tinamous then diverged 
84 mya, while the Neognathae diverged into its stem lineages, the Galloanseres and 
Neoaves, 88 mya. The Galloanserae divergence into the Galliformes (landfowl) and 
Anseriformes (waterfowl) occurred around the time of the K-Pg. The major 
divergences of the Neoaves into Columbea and Passerea are now dated to before 
the K-Pg boundary (67-69 mya). The rest of the divergences within Neoaves were 
largely complete at the ordinal level by 50 mya with the Passeriformes basal split 
estimated to be approximately 39 mya (Jarvis et al. 2014). The K-Pg event was of 
course a period of abrupt, mass global extinction and extreme climate change 
coinciding with the Chicxulub asteroid impact in Mexico (Schulte et al. 2010). It 
was a significant event for archaic birds (Ornithurae), of which modem birds are 
descendants. Recent fossil evidence points to a major radiation of advanced 
omithurines occurring prior to the end of the Cretaceous period. The same group 
then suffered an abrupt extinction around the K-Pg event with their disappearance 
from the fossil evidence from the Paleogene period onwards (Longrich et al. 2011). 
Data from the Jarvis et al.’ (2014) analysis also suggests that the K-Pg transition 
period was one of the rapid neomithine speciation with 36 lineages radiating over a 
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period of 10-15 million years. Genomic studies have therefore, in recent years, 
revolutionised our understanding of avian genomics and its relationship to pheno¬ 
type and diversity. Genomic structure and organisation can be studied from a 
number of perspectives, e.g. genome size, karyotype and nuclear organisation, and 
what the actual overall genomic structure of dinosaurs looked like therefore warrants 
further investigation. 


3 Genome Size Calculated from Direct Fossil Evidence 
and Why There Will Be No Jurassic Park 

When Jurassic Park hit the screens in 1993 (following, in one of the authors’ 
opinion, the even better Michael Crichton novel in 1990), it was a cultural phenom¬ 
enon and an outstanding story. It was of course just fiction. Finding reasonably 
credible (albeit subsequently disproved) claims of isolating dinosaur DNA is diffi¬ 
cult (there are a lot of clearly bogus claims that do not warrant mention here), but 
they include Scott Woodward and colleagues who isolated DNA molecules from 
two ancient bone fragments and produced nine readable sequences from a strand of 
DNA for a particular gene (Woodward et al. 1994). This was, we believe, the first 
report an authoritative journal has published about an apparent success in isolating 
what is presumably dinosaur DNA. In the event, however, the fragments were too 
small to be identified unequivocally as dinosaurian and were probably mammalian 
(and, if so, represented human contamination) (Gibbons 1994; Hedges et al. 1995; 
Zischler et al. 1995; Allard et al. 1995). In a reanalysis of DNA sequence data from a 
dinosaur egg fossil from Xixia, Henan Province, China, Wang and colleagues 
claimed to identify the real source of two putative dinosaur 18S rDNA sequences 
(Wang 1996) but subsequently found them to be of fungal and plant origin (Wang 
et al. 1997). Despite further claims (DeSalle et al. 1992; Cano et al. 1992), it seems 
that the preservation of DNA from dinosaurs is not likely to be possible, and 
therefore attempts to study the dinosaur genome directly are unlikely (Penney 
et al. 2013). In Jurassic Park, dinosaur DNA is extracted from the blood isolated 
from amber-preserved remains of ancient blood-sucking insects. In reality although 
it may be theoretically possible to extra the DNA of the insect itself (there are 
uncorroborated claims of this), any dinosaur DNA is likely to have quickly 
degraded. 

Nevertheless, using osteocyte size as an indicator of genome size from fossilised 
bones, Organ et al. (2007) were able to identify a clear distinction between the small, 
characteristically avian genomes identified in the saurischian-theropod dinosaur 
lineage (that gave rise to modern birds) and the larger inferred genome seen in the 
omithischian dinosaur lineage (that led to the lineage that includes stegosaur and 
Triceratops) that is comparable in size to those of modern crocodiles. They provided 
evidence that saurischian dinosaur genomes averaged around 1.78 pg compared to 
omithischian dinosaur genomes of about 2.49 pg. Their findings suggest that much 
of the genomic adaptation widely considered to be an adaptation for the metabolic 
demands of flight in birds, such as small genome size and low repeat content, are 
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features that evolved between 250 and 230 mya and, in this lineage, have changed 
little since (Organ et al. 2007). These results were initially interpreted as supporting 
the hypothesis that small genome size and low repeat content were a genomic 
exaptation that preceded and facilitated the endothermic metabolic demands of 
birds, e.g. for flight (Hughes and Hughes 1995). It was further hypothesised that 
the avian karyotype evolved in response to a reduction in genome size in birds 
(Hughes and Hughes 1995). This theory was subsequently challenged by a study that 
suggested that a decline in overall genome size occurred in non-volant dinosaurs 
(Organ et al. 2007). As we will see in subsequent sections, however, we have 
evidence that an avian-like karyotype emerged well before the origin of flight and 
was independent of genome size reduction. 


4 Karyotype Formation and Evolution in Dinosaurs 

In the absence of cellular material or even relatively intact DNA molecules (from 
amber or any other biological source), data from genome sequence assemblies of 
extant species provide us with the ability to reconstruct karyotypic structures of 
extinct lineages by inference. We can do this on the proviso that genomes are 
assembled at, or close to, chromosome level, i.e. one contiguous length of sequence 
per chromosome (Zhang et al. 2014). In a study that coincided with the publication 
of the multiple avian genomics and phylogeny papers (Jarvis et al. 2014; Zhang et al. 
2014), we analysed (near) chromosome-level assemblies from six living birds. Using 
an Anolis carolinensis lizard outgroup, we inferred the most likely ancestral karyo¬ 
type of all birds for the macrochromosomes and, because outgroup chromosome- 
level assemblies were relatively sparse for the smaller chromosomes, for the 
Neognathae ancestor for the microchromosomes. We then went on to reconstruct 
the most likely sequence of events that led to contemporary karyotypes in birds. We 
provided evidence that the chicken ( Gallus gallus) was the closest karyotypically to 
the reconstructed ancestral pattern, with budgerigar ( Melopsittacus undulatus) and 
zebra finch ( Taeniopygia guttata) experiencing the greatest number of inter- and 
intrachromosomal rearrangements, respectively (Romanov et al. 2014). 

Last year we (O’Connor et al. 2018) applied a similar approach to that of 5 years 
ago (Romanov et al. 2014) to recreate the most likely ancestral karyotype of 
Diapsida. We used both the multiple genome rearrangement and analysis 
(MGRA2) tool and molecular cytogenetics. That is, focusing our attention on the 
best-quality chromosome-level assemblies of avian (chicken (G. gallus) (Hillier 
et al. 2004; Warren et al. 2017), mallard ( Anas platyrhynchos) (Rao et al. 2012), 
zebra finch ( T . guttata) (Warren et al. 2010)), reptilian [Carolina anole lizard 
(A. carolinensis) (Alfoldi et al. 2011)] and mammalian [grey short-tailed opossum 
(Monodelphis domestica) (Mikkelsen et al. 2007)] genomes, we supplemented 
bioinformatic data with novel molecular cytogenetic approaches including a univer¬ 
sally hybridising BAC probe set (Damas et al. 2017). Indeed, one of the principal 
technical advances that made this study possible was the identification of a probe set 
that would hybridise directly across species that diverged hundreds of millions of 
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years ago (O’Connor et al. 2018). We chose these species as they have robust 
chromosome-level assembled genomes and because M. domestica has a karyotypic 
structure thought to resemble the mammalian ancestor most closely (Mikkelsen et al. 
2007). Among other genome assemblies that might have proven useful in our 
analyses, those generated from alligators and turtles (St John et al. 2012; Shaffer 
et al. 2013) were discounted as they are too fragmented, i.e. not close to chromosome 
level. Also, near chromosome-level assemblies for turkey, budgerigar and ostrich 
were ultimately excluded because our studies in these species (not shown) revealed 
that the level of fragmentation in these genomes had the potential to introduce false 
evolutionary breakpoint regions. Finally, we discounted crocodilians, partly because 
of a relative lack of FISH success after multiple attempts and partly because, in any 
event, all crocodilian species studied do not have a typical archelosaur karyotype, 
i.e. no microchromosomes (Srikulnath et al. 2015). The BACs that we isolated gave 
strong hybridisation signals on two turtles Trachemys scripta (red-eared slider) and 
Apalone spinifera (spiny soft-shelled turtle) and some signals on A. carolinensis 
metaphases. Although these two turtles do not have chromosome-level assemblies, 
molecular cytogenetic analysis allowed us to anchor the series of events from the 
perspective of an archelosaur ancestor such as Eunotosaurus. A combination of 
molecular cytogenetics and bioinformatics therefore allowed us to recreate the inter- 
and intrachromosomal changes that occurred from the ancestral diapsid ancestor, to 
the archelosaur (bird-turtle) ancestor (Benton et al. 2015), and thereafter through the 
theropod dinosaur lineage to birds. 

Our data, and interpretations from it, provide strong evidence that most features 
that we now associate with a “typical avian karyotype” (see our previous chapter) 
were established before the Testudine divergence 255 my a. That is, zoo-FISH 
[taking FISH probes isolated from one species and hybridising to the metaphases 
of another—in this case chicken probes (Damas et al. 2017)] was largely successful 
on the chromosomes of A. carolinensis and especially both turtles (Fig. 2). The data 
clearly indicates that most chicken (and by inference, ancestral avian) chromosomes 
1-28 + Z are syntenic to those of the spiny soft-shelled turtle A. spinifera (In = 66). 
Exceptions included fusions of chromosome 4 in chicken and chromosome 22 in 
turtles. Of the microchromosomal probes, 29 of the original 36 (81%) worked 
successfully on chromosomes 10-15 and 17-28. Chromosomes 16 and 28-38 
were not included because of the lack of sequence coverage. Analysing probes 
that worked on all species revealed that the orthologues of chicken chromosomes 
12 and 13 were fused and chromosome 26 was attached to chromosome 4 in 
T. scripta and A. carolinensis but represented as separate microchromosomes in 
A. spinifera. The chromosome 22 orthologue appeared as a separate chromosome in 
A. carolinensis but as fused to the centre of a macrochromosome in T. scripta. Eight 
bird microchromosome orthologues (10, 11, 15, 17, 19, 21, 23 and 24) were also 
single microchromosomes in all three reptiles studied. To this list we added the 
orthologues of chromosomes 14, 18, 25 and 28 when considering turtles only and 
added chromosomes 12, 13 and 26 as single microchromosomes in A. spinifera 
alone. Hybridisation of some probes to the chromosomes of T. scripta (2n = 50) and 
A. carolinensis metaphases (2 n = 36) revealed some chromosomes with 
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Fig. 2 (a) FISH experiment for chicken chromosome probes derived from chromosome 
19 hybridising to Trachemys scripta (red-eared slider) turtle chromosomes. Avian and some turtle 
karyotypes are very similar, (b) FISH experiment for chicken chromosome probe derived from 
chromosome 26 on Anolis carolinensis (Carolina Anole lizard) indicating a “proto¬ 
microchromosome” 

microchromosomal homologues attached. This means that they have either fused to 
macrochromosomes or, probably more likely, retained the ancestral state present in 
the diapsid ancestor. These we termed “proto-microchromosomes” as seen in 
Fig. 2b. Indeed, the A. carolinensis karyotype had some broad similarities to the 
diapsid ancestor established by the bioinformatic (MGRA) approach. Looking at the 
chromosomes of T. scripta (2n = 50) in isolation, there are broad similarities to birds 
but with more microchromosomal homologues attached to larger chromosomes than 
in A. spinifera. This would either indicate that T. scripta has a karyotype that 
represents an earlier stage of differentiation to the “bird-like” turtle pattern or that 
it subsequently underwent a series of fusions (such as in the crocodilians). The first 
of these two options is the most parsimonious as it requires fewer events. Moreover, 
the range of diploid numbers in turtles is 2 n = 26-68; the majority are more “avian- 
like” than their other reptilian counterparts, inspiring us to further study turtles in 
order to shed increasing light on the evolution of the dinosaur pattern. Among 
macrochromosomes, chromosome painting data indicates that chromosomes 1-9 
+Z are also all precise counterparts, with the avian W chromosome evolving after the 
Palaeognathae-Neognathae divergence since dimorphic sex chromosomes are absent 
from turtle and ratite karyotypes. Given that we are talking about an overall pattern 
involving multiple chromosomes, rather than individual events, convergence (homo- 
plasy) seems to be unlikely, this leaving us with the only option of identity by 
descent leading us to the inescapable conclusion that the “avian-like pattern” (at least 
as far as 2 n = 66 with chromosomes syntenic to modern birds) was in place -255 
mya. 

A narrative emerges (Fig. 3) therefore of a diapsid ancestral karyotype (-275 
mya) with a chromosome number of 2 n = 36-46—roughly half would have been 
macro- and half microchromosomes (Begak et al. 1964; Alfoldi et al. 2011). This is 
the same as that which has been previously reported (Organ et al. 2008) and with 
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Fig. 3 Overall evolution of chromosomes from the diapsid ancestor, to the archelosaur ancestor to 
modem birds 


most other reptiles apart from birds and turtles. Rapid rearrangement appears to have 
then occurred over a period of 20 million years leading to a pattern similar to that 
seen in A. spinifera. We know this because Alfoldi et al. (201 1) found direct synteny 
at the sequence level between the microchromosomes of A. carolinensis and 
G. gallus, with all but one lizard microchromosome corresponding to a single 
chicken microchromosome. Given that the A. carolinensis genome has 
12 microchromosomal pairs compared to the 28-30 seen in most birds, the most 
likely explanation is that these 12 were present in the diapsid ancestor, the remainder 
evolving thereafter by fission (Alfoldi et al. 2011) to at least 2 n = 66 by 255 my a. 
These conclusions are also consistent with previous studies using chicken 
macrochromosome paints on Chinese soft-shelled turtle ( Pelodiscus sinensis) 
{In = 66) (Matsuda et al. 2005), T. scripta (Kasai et al. 2012) and the painted turtle 
(Chrysemys picta) chromosomes (both 2 n = 50) (Badenhorst et al. 2015) which 
provide direct evidence that turtle and bird macrochromosomes are precise 
counterparts of one another. Since 255 mya, only -7 fissions are sufficient to form 
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the pattern that we see in ratites, Galliformes, Anseriformes, Columbea and Passerea 
(among other birds). Determining how soon these changes occurred and the modern 
bird karyotype (2n~ = 80) was complete is difficult, particularly as the various 
crocodilian karyotypes with their many fused chromosomes do not help us. If a 
similar rate of fission that occurred from 275 to 255 my a carried on for another 
15 million years, a complete bird-like karyotype would have emerged before the 
appearance of the earliest dinosaurs and pterosaurs 240 mya (Baron et al. 2017). At 
the other extreme, a complete cessation of fission events 255 mya would indicate that 
the earliest dinosaur and pterosaur karyotypes were more similar to that of 
A. spinifera or P. sinensis. Either way, these two scenarios are very similar. David 
Burt (2002) suggested that most avian microchromosomes were present in the avian 
common ancestor >80 mya (Cracraft et al. 2015), arguing that it probably had the 
small genome size characteristic of birds and a karyotype of around 2 n = 60. In the 
light of our recent data, we would disagree with aspects of this as we suggest that the 
karyotype came before the reduction in genome size (Burt 2002). Indeed Uno and 
colleagues (Uno et al. 2012) suggested that the archelosaur ancestor probably had 
microchromosomes like turtles. 

It cannot be ignored, however, that there is an association between genomes with 
fewer chromosomes (and no microchromosomes) and larger genome sizes around 
2.5-3Gb, as in mammals (Kapusta et al. 2017) and crocodilians (St John et al. 2012). 
More repetitive elements could provide substrates for interchromosomal rearrange¬ 
ment, which is commonplace in mammals but rare in birds, and it has been suggested 
that an avian karyotype provides fewer opportunities for interchromosomal rear¬ 
rangement due to the existence of fewer recombination hotspots (despite a higher 
overall recombination rate) (Kawakami et al. 2014; Smeds et al. 2016), repeat 
structures (Gao et al. 2017; Warren et al. 2017; Mason et al. 2016), endogenous 
retroviruses (Cui et al. 2014; Farre et al. 2016; Romanov et al. 2014) and more 
conserved noncoding elements (Damas et al. 2017). Fission and intrachromosomal 
rearrangement are not precluded in this model, however. Returning to the discussion 
about the relationship between genome size, flight and karyotype therefore, the 
evidence suggests that the avian-like karyotype came first, followed by a reduction 
in genome size, followed by flight. Moreover, although flight evolution might be 
correlated with smaller genome size [consider pterosaurs vs. other avemetatarsalians 
(Organ and Shedlock 2009); bats vs. other mammals (Hughes and Hughes 1995); 
strong vs. weak flying/flightless birds (Gregory 2005)], other mechanisms are almost 
certainly involved. 


5 Intrachromosomal Rearrangement and the Role of Gene 
Ontology Analysis 

Our results suggest that, aside from -7 fissions, the primary mechanism for chromo¬ 
somal rearrangement in the avian stem lineage after 255 mya was intrachromosomal 
(most likely chromosome inversions). Using MGRA2, we generated 19 contiguous 
ancestral regions (CARs). To all intents and purposes, these represented the 
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chromosomes of the diapsid ancestor. That is, while we could not be entirely certain 
that each CAR represented a whole chromosome as MGRA2 will; inherently 
“break” the chromosome if it cannot find synteny; a total of 19 diapsid ancestral 
CARs probably represented a similar number, if slightly fewer, chromosomes. 
Reconstructed CARs, when compared to extant genomes, resulted in the identifica¬ 
tion of rearrangements between the diapsid ancestor and chicken (G. gallus) 
genomes. Doing this, we inferred 49 intrachromosomal (chromosome inversion) 
events. This, however, must be an underestimate due to the lack of sequence 
coverage in some areas, particularly the smallest bird microchromosomes. Deter¬ 
mining rates of change is little more than educated guesswork, but there is some 
evidence of intrachromosomal change speeding up in modern birds, even in the 
chicken, which is thought to be very similar chromosomally to the avian common 
ancestor (Romanov et al. 2014). Increased intrachromosomal change has been 
reported in specific groups, with several studies suggesting that the greatest rates 
would be found in songbirds (Skinner and Griffin 2011; Farre et al. 2016; Zhang 
et al. 2014), the bird group with the most species. Similarly, bursts of speciation may 
have also been accompanied by increased rates of chromosome inversion in other 
dinosaur groups. 

In our recent study of dinosaur karyotypes, we found nearly 400 HSBs (homolo¬ 
gous synteny blocks) and EBRs (evolutionary breakpoint regions) in our analysis 
(O’Connor et al. 2018). The number of HSBs per CAR (“chromosome”) was 
between 2 and 59 depending on the chromosome. In total, 17 chicken chromosomes 
were aligned to these CARs, namely, chromosomes 1-8, 11-13, 15, 18, 24, 26, 
27 and Z; some microchromosomes could not be represented. Reconstructed CARs 
were then mapped to the extant genomes. From the diapsid ancestor to the chicken 
genome, 49 inversions were identified plus 10 interchromosomal changes including 
a translocation between the orthologues of chicken chromosomes 5 and 20, consis¬ 
tent with the FISH results. By combining this with the molecular cytogenetic (FISH) 
data adds to our existing narrative, suggesting that interchromosomal 
rearrangements formed the turtle-like (2 n = 66) pattern 275-255 my a with most 
intrachromosomal rearrangements (inversions) occurring after. For simplicity of 
presentation in Fig. 3, all intrachromosomal changes are shown after formation of 
the basic archelosaur common ancestor pattern 255 mya, although we cannot 
preclude the possibility that some happened before. Between the diapsid and 
archelosaur common ancestors, a fusion most likely occurred to form chromosome 
1, and translocations/fissions occurred between avian ancestral CARs that became 
chromosome 7 (Fig. 3). 

Studies of the best assembled genomes also indicate that EBRs are located within 
gene-dense loci and are enriched with genes related to lineage-specific biology, 
transposable elements and other repetitive sequences (Farre et al. 2016; Damas et al. 
2017; Alfoldi et al. 2011; Hillier et al. 2004). Conversely, sequences that stay 
together during evolution (HSBs) are enriched for developmental genes and regu¬ 
latory elements (Hillier et al. 2004). While random breakage during karyotype 
evolution (Nadeau and Taylor 1984) cannot be excluded completely, especially 
when they have a neutral effect on phenotype, there is mounting evidence that the 
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larger HSBs and selected EBRs (at least in animal genomes) are maintained 
nonrandomly (Pevzner and Tesler 2003; Larkin et al. 2003; Farre et al. 2016; 
Damas et al. 2017). There are certainly regions more prone to breakage (e.g. in 
recombination hotspots or open chromatin areas), and those chromosome breaks not 
disturbing essential genes or providing a selective advantage are more likely to be 
fixed in populations, thereby becoming EBRs (Farre et al. 2016). In other words, 
chromosome rearrangement may serve a functional purpose. 

When we examined GO (gene ontology) terms in HSBs, significant enrichments 
were observed for those relevant to amino acid transmembrane transport and signal¬ 
ling as well as synapse/neurotransmitter transport, nucleoside metabolism and use, 
cell morphogenesis and cytoskeleton and sensory organ development. HSBs are 
often enriched for GO terms related to phenotypic features that remain constant 
(Larkin et al. 2009), and the results presented here are consistent with this hypothe¬ 
sis. Sankoff (2009) however proposed that EBRs are where the “action” in genome 
evolution lies, and, previously, we found that GO terms in avian EBRs are associated 
with specific adaptive features, e.g. enrichment for forebrain development in the 
budgerigar EBRs (consistent with vocal learning) (Farre et al. 2016). Among the 
EBRs, we identified significant enrichments in genes and single GO terms relevant 
to chromatin modification and chromosome organisation as well as proteasome/ 
signalosome structure. Specifically, our first GO annotation cluster consisted of six 
genes related to proteasome/signalosome structure within EBRs located on six 
chicken chromosomes. The second annotation cluster included 15 genes relevant 
to chromatin modification within 12 EBRs on seven chicken chromosomes. These 
results illustrated some parallels to recent findings in rodents where chromosomal 
changes were associated with open chromatin (Capilla et al. 2016) and the majority 
of the 15 genes found in this GO term were related to control of gene expression. 
Transcription factors modify chromatin by making it accessible during transcription, 
and this is relevant because EBRs rearrange transcription factor genes. This might 
affect expression of other genes of the same pathway. It is interesting to note that two 
of these genes showed a different expression pattern when comparing birds and 
mammals: HD AC 8 functioning in early embryo development (Murko 2010) and 
PRMT8 expressed in the brain (Wang et al. 2017). Our results thus suggest a 
correlative link between chromosomal and morphological changes among species, 
in some way mediated by rearranging genes controlling the expression of develop¬ 
mental pathways. 


6 Karyotypic Stability Leading to Phenotypic Diversity? 

The fact that this avian-like karyotype has remained quite unchanged for such a long 
evolutionary period suggests a manner of genome organisation that might have 
provided the raw materials for evolutionary success. Reasons for such success 
might, we have suggested, be due to its ability to generate variation, the basis of 
natural selection. That is, when a karyotype has many chromosomes, including 
microchromosomes with high recombination rates, this inherently generates greater 
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variation through increased genetic recombination and increased random chromo¬ 
some segregation. David Burt (2002) also suggested that a higher recombination rate 
has contributed to the unique genomic features seen in microchromosomes such as 
high GC content, low repeat content and high gene density, which subsequently led 
to its maintenance. Phenotypic variation, in turn, promotes rapid adaptation and may 
therefore have contributed to the >10,500 species of birds and who knows how 
many species of non-avian dinosaurs. Of course, a karyotype with many tiny 
chromosomes is not the only means by which variation can be generated (genic, 
epigenetic and interchromosomal variation all come to mind), and amphibians have 
enormous phenotypic variation but few chromosomes. We nonetheless think that 
this may explain the apparent paradox of a group with very little interchromosomal 
change but incredible phenotypic diversity. Thus, while we have previously thought 
that the characteristic avian karyotype might help explain the differences between a 
hummingbird and an ostrich (and most things in between), we perhaps should 
broaden our minds and think that it might help explain the difference between a 
hummingbird and a tyrannosaur. 

So, although there will be no Jurassic Park, or Jurassic World for that matter, we 
can reconstruct the overall likely karyotype of common ancestors inferring the 
sequence of events that led to living animals. Armed with that information, it 
becomes easy to suggest that if we had the opportunity to make metaphase chromo¬ 
some preparations from the tissue of some of our favourite theropod dinosaurs 
('Tyrannosaurus and Velociraptor are both genera of the group), then karyotype 
and zoo-FISH results would reveal little difference from a modern chicken, pigeon 
duck or ostrich. Of course, it is probably the case that there are some groups which 
underwent significant interchromosomal change: kingfishers (Christidis 1990) 
(many fissions), parrots (Nanda et al. 2007) and falcons (Damas et al. 2017) 
(many fusions) are modern examples of this, but it would be hard to guess what 
they were. 

In discovering that the avian karyotype had deeper origins than previously 
thought, this echoes other recent discoveries about dinosaur morphology, 
demonstrating that features hitherto believed to be characteristic of crown-group 
birds only (e.g. feathers and pneumatised skeletons) arose first among more ancient 
dinosaur or archosaurian ancestors (Zhou 2004; Baron et al. 2017). “Everyone” 
loves dinosaurs. They were the dominant group of animals for nearly 200 million 
years; their radiations following two mass extinction events and, despite being 
almost wiped out by a third (the K-Pg meteor impact), their resilience as a highly 
diverse and speciose clade (extant birds) (Barrowclough et al. 2016) have fascinated 
scientists since they were first discovered. Of course, many of the evolutionary 
changes were in response to a rapidly changing environment, and many would 
have involved individual genes. A possible “next step” in our research therefore 
might be to infer individual genic sequences and their possible functions in individ¬ 
ual animals. 

In conclusion, dinosaur karyotypes are not just descriptions (and inferred ones at 
that). The gross genome organisation and evolution of dinosaur chromosomes 
(including the success of modem birds) might have been a major factor contributing 
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to their phenotypic diversity, unique physiology and ability to adapt (Berv and Field 
2018) and, ultimately, to survive. Jurassic Park surely would have not been quite so 
interesting. 
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